Jie Xu

h-index24

8papers

155citations

Novelty28%

AI Score31

Ranked #133,487 of 194,257 authors (top 69%)#29,364 in LG (top 73%)

8 Papers

3.6ROOct 18, 2023

InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Hanbo Zhang, Jie Xu, Yuchen Mo et al.

Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, \invig, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the \invig dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6\% success rate during validation. To the best of our knowledge, the \invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: \href{https://openivg.github.io}{https://openivg.github.io}.

3.6IVSep 20, 2024

GASA-UNet: Global Axial Self-Attention U-Net for 3D Medical Image Segmentation

Chengkun Sun, Russell Stevens Terry, Jiang Bian et al.

Accurate segmentation of multiple organs and the differentiation of pathological tissues in medical imaging are crucial but challenging, especially for nuanced classifications and ambiguous organ boundaries. To tackle these challenges, we introduce GASA-UNet, a refined U-Net-like model featuring a novel Global Axial Self-Attention (GASA) block. This block processes image data as a 3D entity, with each 2D plane representing a different anatomical cross-section. Voxel features are defined within this spatial context, and a Multi-Head Self-Attention (MHSA) mechanism is utilized on extracted 1D patches to facilitate connections across these planes. Positional embeddings (PE) are incorporated into our attention framework, enriching voxel features with spatial context and enhancing tissue classification and organ edge delineation. Our model has demonstrated promising improvements in segmentation performance, particularly for smaller anatomical structures, as evidenced by enhanced Dice scores and Normalized Surface Dice (NSD) on three benchmark datasets, i.e., BTCV, AMOS, and KiTS23.

6.5CVSep 19, 2024

BGDB: Bernoulli-Gaussian Decision Block with Improved Denoising Diffusion Probabilistic Models

Chengkun Sun, Jinqian Pan, Russell Stevens Terry et al.

Generative models can enhance discriminative classifiers by constructing complex feature spaces, thereby improving performance on intricate datasets. Conventional methods typically augment datasets with more detailed feature representations or increase dimensionality to make nonlinear data linearly separable. Utilizing a generative model solely for feature space processing falls short of unlocking its full potential within a classifier and typically lacks a solid theoretical foundation. We base our approach on a novel hypothesis: the probability information (logit) derived from a single model training can be used to generate the equivalent of multiple training sessions. Leveraging the central limit theorem, this synthesized probability information is anticipated to converge toward the true probability more accurately. To achieve this goal, we propose the Bernoulli-Gaussian Decision Block (BGDB), a novel module inspired by the Central Limit Theorem and the concept that the mean of multiple Bernoulli trials approximates the probability of success in a single trial. Specifically, we utilize Improved Denoising Diffusion Probabilistic Models (IDDPM) to model the probability of Bernoulli Trials. Our approach shifts the focus from reconstructing features to reconstructing logits, transforming the logit from a single iteration into logits analogous to those from multiple experiments. We provide the theoretical foundations of our approach through mathematical analysis and validate its effectiveness through experimental evaluation using various datasets for multiple imaging tasks, including both classification and segmentation.

22.0LGApr 29, 2025

A Survey on Parameter-Efficient Fine-Tuning for Foundation Models in Federated Learning

Jieming Bian, Yuanzhe Peng, Lei Wang et al.

Foundation models have revolutionized artificial intelligence by providing robust, versatile architectures pre-trained on large-scale datasets. However, adapting these massive models to specific downstream tasks requires fine-tuning, which can be prohibitively expensive in computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods address this challenge by selectively updating only a small subset of parameters. Meanwhile, Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, making it ideal for privacy-sensitive applications. This survey provides a comprehensive review of the integration of PEFT techniques within federated learning environments. We systematically categorize existing approaches into three main groups: Additive PEFT (which introduces new trainable parameters), Selective PEFT (which fine-tunes only subsets of existing parameters), and Reparameterized PEFT (which transforms model architectures to enable efficient updates). For each category, we analyze how these methods address the unique challenges of federated settings, including data heterogeneity, communication efficiency, computational constraints, and privacy concerns. We further organize the literature based on application domains, covering both natural language processing and computer vision tasks. Finally, we discuss promising research directions, including scaling to larger foundation models, theoretical analysis of federated PEFT methods, and sustainable approaches for resource-constrained environments.

9.4LGSep 18, 2025

Generative AI Meets Wireless Sensing: Towards Wireless Foundation Model

Zheng Yang, Guoxuan Chi, Chenshu Wu et al. · tsinghua

Generative Artificial Intelligence (GenAI) has made significant advancements in fields such as computer vision (CV) and natural language processing (NLP), demonstrating its capability to synthesize high-fidelity data and improve generalization. Recently, there has been growing interest in integrating GenAI into wireless sensing systems. By leveraging generative techniques such as data augmentation, domain adaptation, and denoising, wireless sensing applications, including device localization, human activity recognition, and environmental monitoring, can be significantly improved. This survey investigates the convergence of GenAI and wireless sensing from two complementary perspectives. First, we explore how GenAI can be integrated into wireless sensing pipelines, focusing on two modes of integration: as a plugin to augment task-specific models and as a solver to directly address sensing tasks. Second, we analyze the characteristics of mainstream generative models, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models, and discuss their applicability and unique advantages across various wireless sensing tasks. We further identify key challenges in applying GenAI to wireless sensing and outline a future direction toward a wireless foundation model: a unified, pre-trained design capable of scalable, adaptable, and efficient signal understanding across diverse sensing tasks.

4.1LGMay 27, 2025

Multimodal Federated Learning: A Survey through the Lens of Different FL Paradigms

Yuanzhe Peng, Jieming Bian, Lei Wang et al.

Multimodal Federated Learning (MFL) lies at the intersection of two pivotal research areas: leveraging complementary information from multiple modalities to improve downstream inference performance and enabling distributed training to enhance efficiency and preserve privacy. Despite the growing interest in MFL, there is currently no comprehensive taxonomy that organizes MFL through the lens of different Federated Learning (FL) paradigms. This perspective is important because multimodal data introduces distinct challenges across various FL settings. These challenges, including modality heterogeneity, privacy heterogeneity, and communication inefficiency, are fundamentally different from those encountered in traditional unimodal or non-FL scenarios. In this paper, we systematically examine MFL within the context of three major FL paradigms: horizontal FL (HFL), vertical FL (VFL), and hybrid FL. For each paradigm, we present the problem formulation, review representative training algorithms, and highlight the most prominent challenge introduced by multimodal data in distributed settings. We also discuss open challenges and provide insights for future research. By establishing this taxonomy, we aim to uncover the novel challenges posed by multimodal data from the perspective of different FL paradigms and to offer a new lens through which to understand and advance the development of MFL.

8.6HCDec 5, 2021

Improving Intention Detection in Single-Trial Classification through Fusion of EEG and Eye-tracker Data

Xianliang Ge, Yunxian Pan, Sujie Wang et al.

Intention decoding is an indispensable procedure in hands-free human-computer interaction (HCI). Conventional eye-tracking system using single-model fixation duration possibly issues commands ignoring users' real expectation. In the current study, an eye-brain hybrid brain-computer interface (BCI) interaction system was introduced for intention detection through fusion of multi-modal eye-track and ERP (a measurement derived from EEG) features. Eye-track and EEG data were recorded from 64 healthy participants as they performed a 40-min customized free search task of a fixed target icon among 25 icons. The corresponding fixation duration of eye-tracking and ERP were extracted. Five previously-validated LDA-based classifiers (including RLDA, SWLDA, BLDA, SKLDA, and STDA) and the widely-used CNN method were adopted to verify the efficacy of feature fusion from both offline and pseudo-online analysis, and optimal approach was evaluated through modulating the training set and system response duration. Our study demonstrated that the input of multi-modal eye-track and ERP features achieved superior performance of intention detection in the single trial classification of active search task. And compared with single-model ERP feature, this new strategy also induced congruent accuracy across different classifiers. Moreover, in comparison with other classification methods, we found that the SKLDA exhibited the superior performance when fusing feature in offline test (ACC=0.8783, AUC=0.9004) and online simulation with different sample amount and duration length. In sum, the current study revealed a novel and effective approach for intention classification using eye-brain hybrid BCI, and further supported the real-life application of hands-free HCI in a more precise and stable manner.

36.1LGNov 13, 2019

Federated Learning for Healthcare Informatics

Jie Xu, Benjamin S. Glicksberg, Chang Su et al.

With the rapid development of computer software and hardware technologies, more and more healthcare data are becoming readily available from clinical institutions, patients, insurance companies and pharmaceutical industries, among others. This access provides an unprecedented opportunity for data science technologies to derive data-driven insights and improve the quality of care delivery. Healthcare data, however, are usually fragmented and private making it difficult to generate robust results across populations. For example, different hospitals own the electronic health records (EHR) of different patient populations and these records are difficult to share across hospitals because of their sensitive nature. This creates a big barrier for developing effective analytical approaches that are generalizable, which need diverse, "big data". Federated learning, a mechanism of training a shared global model with a central server while keeping all the sensitive data in local institutions where the data belong, provides great promise to connect the fragmented healthcare data sources with privacy-preservation. The goal of this survey is to provide a review for federated learning technologies, particularly within the biomedical space. In particular, we summarize the general solutions to the statistical challenges, system challenges and privacy issues in federated learning, and point out the implications and potentials in healthcare.