Xiaoming Fu

NI
h-index24
14papers
377citations
Novelty50%
AI Score56

14 Papers

CLOct 13, 2023Code
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Xiaocui Yang, Wenfang Wu, Shi Feng et al.

The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at https://github.com/declare-lab/MM-BigBench.

AIApr 16Code
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

Yi Zhao, Yajuan Peng, Cam-Tu Nguyen et al.

Large Reasoning Models (LRMs) achieve strong performance on complex tasks through extended chains of thought but suffer from high inference latency due to autoregressive reasoning. Recent work explores using Small Reasoning Models (SRMs) to accelerate LRM inference. In this paper, we systematically characterize the capability boundaries of SRMs and identify three common types of reasoning risks: (1) path divergence, where SRMs lack the strategic ability to construct an initial plan, causing reasoning to deviate from the most probable path; (2) cognitive overload, where SRMs fail to solve particularly difficult steps; and (3) recovery inability, where SRMs lack robust self-reflection and error correction mechanisms. To address these challenges, we propose TrigReason, a trigger-based collaborative reasoning framework that replaces continuous polling with selective intervention. TrigReason delegates most reasoning to the SRM and activates LRM intervention only when necessary-during initial strategic planning (strategic priming trigger), upon detecting extraordinary overconfidence (cognitive offload trigger), or when reasoning falls into unproductive loops (intervention request trigger). The evaluation results on AIME24, AIME25, and GPQA-D indicate that TrigReason matches the accuracy of full LRMs and SpecReason, while offloading 1.70x - 4.79x more reasoning steps to SRMs. Under edge-cloud conditions, TrigReason reduces latency by 43.9\% and API cost by 73.3\%. Our code is available at \href{https://github.com/QQQ-yi/TrigReason}{https://github.com/QQQ-yi/TrigReason}

CVJan 20, 2023
AccDecoder: Accelerated Decoding for Neural-enhanced Video Analytics

Tingting Yuan, Liang Mi, Weijun Wang et al. · apple-ml

The quality of the video stream is key to neural network-based video analytics. However, low-quality video is inevitably collected by existing surveillance systems because of poor quality cameras or over-compressed/pruned video streaming protocols, e.g., as a result of upstream bandwidth limit. To address this issue, existing studies use quality enhancers (e.g., neural super-resolution) to improve the quality of videos (e.g., resolution) and eventually ensure inference accuracy. Nevertheless, directly applying quality enhancers does not work in practice because it will introduce unacceptable latency. In this paper, we present AccDecoder, a novel accelerated decoder for real-time and neural-enhanced video analytics. AccDecoder can select a few frames adaptively via Deep Reinforcement Learning (DRL) to enhance the quality by neural super-resolution and then up-scale the unselected frames that reference them, which leads to 6-21% accuracy improvement. AccDecoder provides efficient inference capability via filtering important frames using DRL for DNN-based inference and reusing the results for the other frames via extracting the reference relationship among frames and blocks, which results in a latency reduction of 20-80% than baselines.

MADec 3, 2022
DACOM: Learning Delay-Aware Communication for Multi-Agent Reinforcement Learning

Tingting Yuan, Hwei-Ming Chung, Jie Yuan et al.

Communication is supposed to improve multi-agent collaboration and overall performance in cooperative Multi-agent reinforcement learning (MARL). However, such improvements are prevalently limited in practice since most existing communication schemes ignore communication overheads (e.g., communication delays). In this paper, we demonstrate that ignoring communication delays has detrimental effects on collaborations, especially in delay-sensitive tasks such as autonomous driving. To mitigate this impact, we design a delay-aware multi-agent communication model (DACOM) to adapt communication to delays. Specifically, DACOM introduces a component, TimeNet, that is responsible for adjusting the waiting time of an agent to receive messages from other agents such that the uncertainty associated with delay can be addressed. Our experiments reveal that DACOM has a non-negligible performance improvement over other mechanisms by making a better trade-off between the benefits of communication and the costs of waiting for messages.

MAApr 26, 2022
PP-MARL: Efficient Privacy-Preserving Multi-Agent Reinforcement Learning for Cooperative Intelligence in Communications

Tingting Yuan, Hwei-Ming Chung, Xiaoming Fu

Cooperative intelligence (CI) is expected to become an integral element in next-generation networks because it can aggregate the capabilities and intelligence of multiple devices. Multi-agent reinforcement learning (MARL) is a popular approach for achieving CI in communication problems by enabling effective collaboration among agents to address sequential problems. However, ensuring privacy protection for MARL is a challenging task because of the presence of heterogeneous agents that learn interdependently via sharing information. Implementing privacy protection techniques such as data encryption and federated learning to MARL introduces the notable overheads (e.g., computation and bandwidth). To overcome these challenges, we propose PP-MARL, an efficient privacy-preserving learning scheme for MARL. PP-MARL leverages homomorphic encryption (HE) and differential privacy (DP) to protect privacy, while introducing split learning to decrease overheads via reducing the volume of shared messages, and then improve efficiency. We apply and evaluate PP-MARL in two communication-related use cases. Simulation results reveal that PP-MARL can achieve efficient and reliable collaboration with 1.1-6 times better privacy protection and lower overheads (e.g., 84-91% reduction in bandwidth) than state-of-the-art approaches.

NIMay 4
Choir: Tackling RTBC Performance Impossible Triangle with 5G Collaboration

Wenji Du, Wanghong Yang, Baosen Zhao et al.

Real-time broadband communication (RTBC) scenarios, such as cloud virtual reality and 8K live streaming, further raise the criteria of the performance triangle, requiring video bitrates exceeding 30 Mbps, tail delay below 50 ms, and fairness guarantees for multi-user concurrent access. Based on our testing and analysis, existing RTBC-oriented rate control solutions, including end-to-end algorithms and network-assisted algorithms, fail to simultaneously satisfy all performance metrics. The native dynamic delay and physical-layer resource allocation strategy inherent to the 5G radio access network (RAN) are the key reasons. These solutions lack adaptation to the 5G architecture, leading to reduced decision performance. This paper proposes Choir, an innovative collaborative solution mainly deployed on 5G base stations that deeply integrates 5G radio characteristics and video streaming traffic patterns to guide efficient sender-side rate control. Extensive simulation and testbed evaluations demonstrate Choir's significant performance in achieving high average bitrate, low tail delay, and inter-flow fairness across different 5G network scenarios.

NIDec 25, 2023
BiSwift: Bandwidth Orchestrator for Multi-Stream Video Analytics on Edge

Lin Sun, Weijun Wang, Tingting Yuan et al.

High-definition (HD) cameras for surveillance and road traffic have experienced tremendous growth, demanding intensive computation resources for real-time analytics. Recently, offloading frames from the front-end device to the back-end edge server has shown great promise. In multi-stream competitive environments, efficient bandwidth management and proper scheduling are crucial to ensure both high inference accuracy and high throughput. To achieve this goal, we propose BiSwift, a bi-level framework that scales the concurrent real-time video analytics by a novel adaptive hybrid codec integrated with multi-level pipelines, and a global bandwidth controller for multiple video streams. The lower-level front-back-end collaborative mechanism (called adaptive hybrid codec) locally optimizes the accuracy and accelerates end-to-end video analytics for a single stream. The upper-level scheduler aims to accuracy fairness among multiple streams via the global bandwidth controller. The evaluation of BiSwift shows that BiSwift is able to real-time object detection on 9 streams with an edge device only equipped with an NVIDIA RTX3070 (8G) GPU. BiSwift improves 10%$\sim$21% accuracy and presents 1.2$\sim$9$\times$ throughput compared with the state-of-the-art video analytics pipelines.

GRMar 16
Masked BRep Autoencoder via Hierarchical Graph Transformer

Yifei Li, Kang Wu, Wenming Wu et al.

We introduce a novel self-supervised learning framework that automatically learns representations from input computer-aided design (CAD) models for downstream tasks, including part classification, modeling segmentation, and machining feature recognition. To train our network, we construct a large-scale, unlabeled dataset of boundary representation (BRep) models. The success of our algorithm relies on two keycomponents. The first is a masked graph autoencoder that reconstructs randomly masked geometries and attributes of BReps for representation learning to enhance the generalization. The second is a hierarchical graph Transformer architecture that elegantly fuses global and local learning by a cross-scale mutual attention block to model long-range geometric dependencies and a graph neural network block to aggregate local topological information. After training the autoencoder, we replace its decoder with a task-specific network trained on a small amount of labeled data for downstream tasks. We conduct experiments on various tasks and achieve high performance, even with a small amount of labeled data, demonstrating the practicality and generalizability of our model. Compared to other methods, our model performs significantly better on downstream tasks with the same amount of training data, particularly when the training data is very limited.

LGAug 3, 2025
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Yi Zhao, Yajuan Peng, Cam-Tu Nguyen et al.

KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model's attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of SmallKV. Moreover, efficiency evaluations show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant LLM inference in resource constrained environments.

CVSep 12, 2025
SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

Wenfang Wu, Tingting Yuan, Yupeng Li et al.

Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

LGDec 18, 2021
GNN-Geo: A Graph Neural Network-based Fine-grained IP geolocation Framework

Shichang Ding, Xiangyang Luo, Jinwei Wang et al.

Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-structured data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method -- Graph Neural Network (GNN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GNN-based IP geolocation framework named GNN-Geo. GNN-Geo consists of a preprocessor, an encoder, messaging passing (MP) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. MP layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem by considering prior knowledge. The experiments in 8 real-world IPv4/IPv6 networks in North America, Europe and Asia show the proposed GNN-Geo clearly outperforms the state-of-art rule-based and learning-based baselines. This work verifies the great potential of GNN for fine-grained IP geolocation.

NIApr 2, 2020
RACE: Reinforced Cooperative Autonomous Vehicle Collision AvoidancE

Yali Yuan, Robert Tasik, Sripriya Srikant Adhatarao et al.

With the rapid development of autonomous driving, collision avoidance has attracted attention from both academia and industry. Many collision avoidance strategies have emerged in recent years, but the dynamic and complex nature of driving environment poses a challenge to develop robust collision avoidance algorithms. Therefore, in this paper, we propose a decentralized framework named RACE: Reinforced Cooperative Autonomous Vehicle Collision AvoidancE. Leveraging a hierarchical architecture we develop an algorithm named Co-DDPG to efficiently train autonomous vehicles. Through a security abiding channel, the autonomous vehicles distribute their driving policies. We use the relative distances obtained by the opponent sensors to build the VANET instead of locations, which ensures the vehicle's location privacy. With a leader-follower architecture and parameter distribution, RACE accelerates the learning of optimal policies and efficiently utilizes the remaining resources. We implement the RACE framework in the widely used TORCS simulator and conduct various experiments to measure the performance of RACE. Evaluations show that RACE quickly learns optimal driving policies and effectively avoids collisions. Moreover, RACE also scales smoothly with varying number of participating vehicles. We further compared RACE with existing autonomous driving systems and show that RACE outperforms them by experiencing 65% less collisions in the training process and exhibits improved performance under varying vehicle density.

CYFeb 21, 2017
Trajectory Recovery From Ash: User Privacy Is NOT Preserved in Aggregated Mobility Data

Fengli Xu, Zhen Tu, Yong Li et al.

Human mobility data has been ubiquitously collected through cellular networks and mobile applications, and publicly released for academic research and commercial purposes for the last decade. Since releasing individual's mobility records usually gives rise to privacy issues, datasets owners tend to only publish aggregated mobility data, such as the number of users covered by a cellular tower at a specific timestamp, which is believed to be sufficient for preserving users' privacy. However, in this paper, we argue and prove that even publishing aggregated mobility data could lead to privacy breach in individuals' trajectories. We develop an attack system that is able to exploit the uniqueness and regularity of human mobility to recover individual's trajectories from the aggregated mobility data without any prior knowledge. By conducting experiments on two real-world datasets collected from both mobile application and cellular network, we reveal that the attack system is able to recover users' trajectories with accuracy about 73%~91% at the scale of tens of thousands to hundreds of thousands users, which indicates severe privacy leakage in such datasets. Through the investigation on aggregated mobility data, our work recognizes a novel privacy problem in publishing statistic data, which appeals for immediate attentions from both academy and industry.

HCMar 17, 2015
Recent Advances and Challenges in Ubiquitous Sensing

Stephan Sigg, Kai Kunze, Xiaoming Fu

Ubiquitous sensing is tightly coupled with activity recognition. This survey reviews recent advances in Ubiquitous sensing and looks ahead on promising future directions. In particular, Ubiquitous sensing crosses new barriers giving us new ways to interact with the environment or to inspect our psyche. Through sensing paradigms that parasitically utilise stimuli from the noise of environmental, third-party pre-installed systems, sensing leaves the boundaries of the personal domain. Compared to previous environmental sensing approaches, these new systems mitigate high installation and placement cost by providing a robustness towards process noise. On the other hand, sensing focuses inward and attempts to capture mental activities such as cognitive load, fatigue or emotion through advances in, for instance, eye-gaze sensing systems or interpretation of body gesture or pose. This survey summarises these developments and discusses current research questions and promising future directions.