Bruce X. B. Yu

CV
h-index28
8papers
292citations
Novelty46%
AI Score42

8 Papers

CVJul 12, 2023Code
GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video

Bruce X. B. Yu, Zhi Zhang, Yongxu Liu et al.

3D human pose estimation has been researched for decades with promising fruits. 3D human pose lifting is one of the promising research directions toward the task where both estimated pose and ground truth pose data are used for training. Existing pose lifting works mainly focus on improving the performance of estimated pose, but they usually underperform when testing on the ground truth pose data. We observe that the performance of the estimated pose can be easily improved by preparing good quality 2D pose, such as fine-tuning the 2D pose or using advanced 2D pose detectors. As such, we concentrate on improving the 3D human pose lifting via ground truth data for the future improvement of more quality estimated pose data. Towards this goal, a simple yet effective model called Global-local Adaptive Graph Convolutional Network (GLA-GCN) is proposed in this work. Our GLA-GCN globally models the spatiotemporal structure via a graph representation and backtraces local joint features for 3D human pose estimation via individually connected layers. To validate our model design, we conduct extensive experiments on three benchmark datasets: Human3.6M, HumanEva-I, and MPI-INF-3DHP. Experimental results show that our GLA-GCN implemented with ground truth 2D poses significantly outperforms state-of-the-art methods (e.g., up to around 3%, 17%, and 14% error reductions on Human3.6M, HumanEva-I, and MPI-INF-3DHP, respectively). GitHub: https://github.com/bruceyo/GLA-GCN.

CVOct 3, 2022Code
Towards a Unified View on Visual Parameter-Efficient Transfer Learning

Bruce X. B. Yu, Jianlong Chang, Lingbo Liu et al.

Parameter efficient transfer learning (PETL) aims at making good use of the representation knowledge in the pre-trained large models by fine-tuning a small number of parameters. Recently, taking inspiration from the natural language processing (NLP) domain, popular PETL techniques such as prompt-tuning and Adapter have also been successfully applied to the vision domain. However, prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large vision models (LVMs) to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the trade-off. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pre-training mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of the differences between NLP and vision data, we propose a new variation of the prefix-tuning module called parallel attention (PATT) for vision downstream tasks. An extensive empirical analysis on vision tasks via different frozen LVMs has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far fewer parameters. Code and data are available at: https://github.com/bruceyo/V-PETL.

CVAug 22, 2022
Prompt-Matched Semantic Segmentation

Lingbo Liu, Jianlong Chang, Bruce X. B. Yu et al.

The objective of this work is to explore how to effectively and efficiently adapt pre-trained visual foundation models to various downstream tasks of semantic segmentation. Previous methods usually fine-tuned the entire networks for each specific dataset, which will be burdensome to store massive parameters of these networks. A few recent works attempted to insert some extra trainable parameters into the frozen networks to learn visual prompts for parameter-efficient tuning. However, these works showed poor generality as they were designed specifically for Transformers. Moreover, using limited information in these schemes, they exhibited a poor capacity to learn beneficial prompts. To alleviate these issues, we propose a novel Stage-wise Prompt-Matched Framework for generic and effective visual prompt tuning. Specifically, to ensure generality, we divide the pre-trained backbone with frozen parameters into multiple stages and perform prompt learning between different stages, which makes the proposed scheme applicable to various architectures of CNN and Transformer. For effective tuning, a lightweight Semantic-aware Prompt Matcher (SPM) is designed to progressively learn reasonable prompts with a recurrent mechanism, guided by the rich information of interim semantic maps. Working as deep matched filter of representation learning, the proposed SPM can well transform the output of the previous stage into a desirable input for the next stage, thus achieving the better matching/stimulating for the pre-trained knowledge. Extensive experiments on four benchmarks demonstrate that the proposed scheme can achieve a promising trade-off between parameter efficiency and performance effectiveness. Our code and models will be released.

SDAug 27, 2025
CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation

Zhejing Hu, Yan Liu, Gong Chen et al.

Generative artificial intelligence in music has made significant strides, yet it still falls short of the substantial achievements seen in natural language processing, primarily due to the limited availability of music data. Knowledge-informed approaches have been shown to enhance the performance of music generation models, even when only a few pieces of musical knowledge are integrated. This paper seeks to leverage comprehensive music theory in AI-driven music generation tasks, such as algorithmic composition and style transfer, which traditionally require significant manual effort with existing techniques. We introduce a novel automatic music lexicon construction model that generates a lexicon, named CompLex, comprising 37,432 items derived from just 9 manually input category keywords and 5 sentence prompt templates. A new multi-agent algorithm is proposed to automatically detect and mitigate hallucinations. CompLex demonstrates impressive performance improvements across three state-of-the-art text-to-music generation models, encompassing both symbolic and audio-based methods. Furthermore, we evaluate CompLex in terms of completeness, accuracy, non-redundancy, and executability, confirming that it possesses the key characteristics of an effective lexicon.

LGAug 15, 2025
PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding

Changhong Jing, Yan Liu, Shuqiang Wang et al.

Cross-subject electroencephalography (EEG) decoding remains a fundamental challenge in brain-computer interface (BCI) research due to substantial inter-subject variability and the scarcity of subject-invariant representations. This paper proposed PTSM (Physiology-aware and Task-invariant Spatio-temporal Modeling), a novel framework for interpretable and robust EEG decoding across unseen subjects. PTSM employs a dual-branch masking mechanism that independently learns personalized and shared spatio-temporal patterns, enabling the model to preserve individual-specific neural characteristics while extracting task-relevant, population-shared features. The masks are factorized across temporal and spatial dimensions, allowing fine-grained modulation of dynamic EEG patterns with low computational overhead. To further address representational entanglement, PTSM enforces information-theoretic constraints that decompose latent embeddings into orthogonal task-related and subject-related subspaces. The model is trained end-to-end via a multi-objective loss integrating classification, contrastive, and disentanglement objectives. Extensive experiments on cross-subject motor imagery datasets demonstrate that PTSM achieves strong zero-shot generalization, outperforming state-of-the-art baselines without subject-specific calibration. Results highlight the efficacy of disentangled neural representations for achieving both personalized and transferable decoding in non-stationary neurophysiological settings.

CVMay 10, 2023
Visual Tuning

Bruce X. B. Yu, Jianlong Chang, Haixin Wang et al.

Fine-tuning visual models has been widely shown promising performance on many downstream visual tasks. With the surprising development of pre-trained visual foundation models, visual tuning jumped out of the standard modus operandi that fine-tunes the whole pre-trained model or just the fully connected layer. Instead, recent advances can achieve superior performance than full-tuning the whole pre-trained parameters by updating far fewer parameters, enabling edge devices and downstream applications to reuse the increasingly large foundation models deployed on the cloud. With the aim of helping researchers get the full picture and future directions of visual tuning, this survey characterizes a large and thoughtful selection of recent works, providing a systematic and comprehensive overview of existing work and models. Specifically, it provides a detailed background of visual tuning and categorizes recent visual tuning techniques into five groups: prompt tuning, adapter tuning, parameter tuning, and remapping tuning. Meanwhile, it offers some exciting research directions for prospective pre-training and various interactions in visual tuning.

CVApr 29, 2020
Skeleton Focused Human Activity Recognition in RGB Video

Bruce X. B. Yu, Yan Liu, Keith C. C. Chan

The data-driven approach that learns an optimal representation of vision features like skeleton frames or RGB videos is currently a dominant paradigm for activity recognition. While great improvements have been achieved from existing single modal approaches with increasingly larger datasets, the fusion of various data modalities at the feature level has seldom been attempted. In this paper, we propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity. The objective is to improve the activity recognition accuracy by effectively utilizing the mutual complemental information among different data modalities. For the skeleton modality, we propose to use a graph convolutional subnetwork to learn the skeleton representation. Whereas for the RGB modality, we will use the spatial-temporal region of interest from RGB videos and take the attention features from the skeleton modality to guide the learning process. The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance, which indicates that the proposed skeleton-driven attention mechanism for the RGB modality increases the mutual communication between different data modalities and brings more discriminative features for inferring human activities.

CVApr 29, 2020
Effective Human Activity Recognition Based on Small Datasets

Bruce X. B. Yu, Yan Liu, Keith C. C. Chan

Most recent work on vision-based human activity recognition (HAR) focuses on designing complex deep learning models for the task. In so doing, there is a requirement for large datasets to be collected. As acquiring and processing large training datasets are usually very expensive, the problem of how dataset size can be reduced without affecting recognition accuracy has to be tackled. To do so, we propose a HAR method that consists of three steps: (i) data transformation involving the generation of new features based on transforming of raw data, (ii) feature extraction involving the learning of a classifier based on the AdaBoost algorithm and the use of training data consisting of the transformed features, and (iii) parameter determination and pattern recognition involving the determination of parameters based on the features generated in (ii) and the use of the parameters as training data for deep learning algorithms to be used to recognize human activities. Compared to existing approaches, this proposed approach has the advantageous characteristics that it is simple and robust. The proposed approach has been tested with a number of experiments performed on a relatively small real dataset. The experimental results indicate that using the proposed method, human activities can be more accurately recognized even with smaller training data size.