Philip Ogunbona

CV
h-index38
19papers
2,093citations
Novelty46%
AI Score33

19 Papers

CVAug 18, 2024
Joint Temporal Pooling for Improving Skeleton-based Action Recognition

Shanaka Ramesh Gunasekara, Wanqing Li, Jack Yang et al.

In skeleton-based human action recognition, temporal pooling is a critical step for capturing spatiotemporal relationship of joint dynamics. Conventional pooling methods overlook the preservation of motion information and treat each frame equally. However, in an action sequence, only a few segments of frames carry discriminative information related to the action. This paper presents a novel Joint Motion Adaptive Temporal Pooling (JMAP) method for improving skeleton-based action recognition. Two variants of JMAP, frame-wise pooling and joint-wise pooling, are introduced. The efficacy of JMAP has been validated through experiments on the popular NTU RGB+D 120 and PKU-MMD datasets.

CVMay 29, 2025
Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

Shanaka Ramesh Gunasekara, Wanqing Li, Philip Ogunbona et al.

Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

CVMay 26, 2020
Unsupervised Domain Expansion from Multiple Sources

Jing Zhang, Wanqing Li, Lu sheng et al.

Given an existing system learned from previous source domains, it is desirable to adapt the system to new domains without accessing and forgetting all the previous domains in some applications. This problem is known as domain expansion. Unlike traditional domain adaptation in which the target domain is the domain defined by new data, in domain expansion the target domain is formed jointly by the source domains and the new domain (hence, domain expansion) and the label function to be learned must work for the expanded domain. Specifically, this paper presents a method for unsupervised multi-source domain expansion (UMSDE) where only the pre-learned models of the source domains and unlabelled new domain data are available. We propose to use the predicted class probability of the unlabelled data in the new domain produced by different source models to jointly mitigate the biases among domains, exploit the discriminative information in the new domain, and preserve the performance in the source domains. Experimental results on the VLCS, ImageCLEF_DA and PACS datasets have verified the effectiveness of the proposed method.

CVMar 25, 2018
Importance Weighted Adversarial Nets for Partial Domain Adaptation

Jing Zhang, Zewei Ding, Wanqing Li et al.

This paper proposes an importance weighted adversarial nets-based method for unsupervised domain adaptation, specific for partial domain adaptation where the target domain has less number of classes compared to the source domain. Previous domain adaptation methods generally assume the identical label spaces, such that reducing the distribution divergence leads to feasible knowledge transfer. However, such an assumption is no longer valid in a more realistic scenario that requires adaptation from a larger and more diverse source domain to a smaller target domain with less number of classes. This paper extends the adversarial nets-based domain adaptation and proposes a novel adversarial nets-based partial domain adaptation method to identify the source samples that are potentially from the outlier classes and, at the same time, reduce the shift of shared classes between domains.

CVMar 25, 2018
Unsupervised Domain Adaptation: A Multi-task Learning-based Method

Jing Zhang, Wanqing Li, Philip Ogunbona

This paper presents a novel multi-task learning-based method for unsupervised domain adaptation. Specifically, the source and target domain classifiers are jointly learned by considering the geometry of target domain and the divergence between the source and target domains based on the concept of multi-task learning. Two novel algorithms are proposed upon the method using Regularized Least Squares and Support Vector Machines respectively. Experiments on both synthetic and real world cross domain recognition tasks have shown that the proposed methods outperform several state-of-the-art domain adaptation methods.

CVMar 17, 2018
Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Pichao Wang, Wanqing Li, Zhimin Gao et al.

This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI), for both isolated and continuous action recognition. These dynamic images are constructed from a segmented sequence of depth maps using hierarchical bidirectional rank pooling to effectively capture the spatial-temporal information. Specifically, DDI exploits the dynamics of postures over time and DDNI and DDMNI exploit the 3D structural information captured by depth maps. Upon the proposed representations, a ConvNet based method is developed for action recognition. The image-based representations enable us to fine-tune the existing Convolutional Neural Network (ConvNet) models trained on image data without training a large number of parameters from scratch. The proposed method achieved the state-of-art results on three large datasets, namely, the Large-scale Continuous Gesture Recognition Dataset (means Jaccard index 0.4109), the Large-scale Isolated Gesture Recognition Dataset (59.21%), and the NTU RGB+D Dataset (87.08% cross-subject and 84.22% cross-view) even though only the depth modality was used.

CVDec 5, 2017
Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Pichao Wang, Wanqing Li, Jun Wan et al.

A novel deep neural network training paradigm that exploits the conjoint information in multiple heterogeneous sources is proposed. Specifically, in a RGB-D based action recognition task, it cooperatively trains a single convolutional neural network (named c-ConvNet) on both RGB visual features and depth features, and deeply aggregates the two kinds of features for action recognition. Differently from the conventional ConvNet that learns the deep separable features for homogeneous modality-based classification with only one softmax loss function, the c-ConvNet enhances the discriminative power of the deeply learned features and weakens the undesired modality discrepancy by jointly optimizing a ranking loss and a softmax loss for both homogeneous and heterogeneous modalities. The ranking loss consists of intra-modality and cross-modality triplet losses, and it reduces both the intra-modality and cross-modality feature variations. Furthermore, the correlations between RGB and depth data are embedded in the c-ConvNet, and can be retrieved by either of the modalities and contribute to the recognition in the case even only one of the modalities is available. The proposed method was extensively evaluated on two large RGB-D action recognition datasets, ChaLearn LAP IsoGD and NTU RGB+D datasets, and one small dataset, SYSU 3D HOI, and achieved state-of-the-art results.

CVOct 31, 2017
RGB-D-based Human Motion Recognition with Deep Learning: A Survey

Pichao Wang, Wanqing Li, Philip Ogunbona et al.

Human motion recognition is one of the most important branches of human-centered research activities. In recent years, motion recognition based on RGB-D data has attracted much attention. Along with the development in artificial intelligence, deep learning techniques have gained remarkable success in computer vision. In particular, convolutional neural networks (CNN) have achieved great success for image-based tasks, and recurrent neural networks (RNN) are renowned for sequence-based problems. Specifically, deep learning methods based on the CNN and RNN architectures have been adopted for motion recognition using RGB-D data. In this paper, a detailed overview of recent advances in RGB-D-based motion recognition is presented. The reviewed methods are broadly categorized into four groups, depending on the modality adopted for recognition: RGB-based, depth-based, skeleton-based and RGB+D-based. As a survey focused on the application of deep learning to RGB-D-based motion recognition, we explicitly discuss the advantages and limitations of existing techniques. Particularly, we highlighted the methods of encoding spatial-temporal-structural information inherent in video sequence, and discuss potential directions for future research.

CVMay 16, 2017
Joint Geometrical and Statistical Alignment for Visual Domain Adaptation

Jing Zhang, Wanqing Li, Philip Ogunbona

This paper presents a novel unsupervised domain adaptation method for cross-domain visual recognition. We propose a unified framework that reduces the shift between domains both statistically and geometrically, referred to as Joint Geometrical and Statistical Alignment (JGSA). Specifically, we learn two coupled projections that project the source domain and target domain data into low dimensional subspaces where the geometrical shift and distribution shift are reduced simultaneously. The objective function can be solved efficiently in a closed form. Extensive experiments have verified that the proposed method significantly outperforms several state-of-the-art domain adaptation methods on a synthetic dataset and three different real world cross-domain visual recognition tasks.

CVMay 11, 2017
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

Jing Zhang, Wanqing Li, Philip Ogunbona et al.

This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly.

CVFeb 28, 2017
Scene Flow to Action Map: A New Representation for RGB-D based Action Recognition with Convolutional Neural Networks

Pichao Wang, Wanqing Li, Zhimin Gao et al.

Scene flow describes the motion of 3D objects in real world and potentially could be the basis of a good feature for 3D action recognition. However, its use for action recognition, especially in the context of convolutional neural networks (ConvNets), has not been previously studied. In this paper, we propose the extraction and use of scene flow for action recognition from RGB-D data. Previous works have considered the depth and RGB modalities as separate channels and extract features for later fusion. We take a different approach and consider the modalities as one entity, thus allowing feature extraction for action recognition at the beginning. Two key questions about the use of scene flow for action recognition are addressed: how to organize the scene flow vectors and how to represent the long term dynamics of videos based on scene flow. In order to calculate the scene flow correctly on the available datasets, we propose an effective self-calibration method to align the RGB and depth data spatially without knowledge of the camera parameters. Based on the scene flow vectors, we propose a new representation, namely, Scene Flow to Action Map (SFAM), that describes several long term spatio-temporal dynamics for action recognition. We adopt a channel transform kernel to transform the scene flow vectors to an optimal color space analogous to RGB. This transformation takes better advantage of the trained ConvNets models over ImageNet. Experimental results indicate that this new representation can surpass the performance of state-of-the-art methods on two large public datasets.

CVJan 7, 2017
Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks

Pichao Wang, Wanqing Li, Song Liu et al.

This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI). These dynamic images are constructed from a sequence of depth maps using bidirectional rank pooling to effectively capture the spatial-temporal information. Such image-based representations enable us to fine-tune the existing ConvNets models trained on image data for classification of depth sequences, without introducing large parameters to learn. Upon the proposed representations, a convolutional Neural networks (ConvNets) based method is developed for gesture recognition and evaluated on the Large-scale Isolated Gesture Recognition at the ChaLearn Looking at People (LAP) challenge 2016. The method achieved 55.57\% classification accuracy and ranked $2^{nd}$ place in this challenge but was very close to the best performance even though we only used depth data.

CVAug 22, 2016
Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

Pichao Wang, Wanqing Li, Song Liu et al.

This paper addresses the problem of continuous gesture recognition from sequences of depth maps using convolutional neutral networks (ConvNets). The proposed method first segments individual gestures from a depth sequence based on quantity of movement (QOM). For each segmented gesture, an Improved Depth Motion Map (IDMM), which converts the depth sequence into one image, is constructed and fed to a ConvNet for recognition. The IDMM effectively encodes both spatial and temporal information and allows the fine-tuning with existing ConvNet models for classification without introducing millions of parameters to learn. The proposed method is evaluated on the Large-scale Continuous Gesture Recognition of the ChaLearn Looking at People (LAP) challenge 2016. It achieved the performance of 0.2655 (Mean Jaccard Index) and ranked $3^{rd}$ place in this challenge.

CVApr 1, 2016
Learning a Pose Lexicon for Semantic Action Recognition

Lijuan Zhou, Wanqing Li, Philip Ogunbona

This paper presents a novel method for learning a pose lexicon comprising semantic poses defined by textual instructions and their associated visual poses defined by visual features. The proposed method simultaneously takes two input streams, semantic poses and visual pose candidates, and statistically learns a mapping between them to construct the lexicon. With the learned lexicon, action recognition can be cast as the problem of finding the maximum translation probability of a sequence of semantic poses given a stream of visual pose candidates. Experiments evaluating pre-trained and zero-shot action recognition conducted on MSRC-12 gesture and WorkoutSu-10 exercise datasets were used to verify the efficacy of the proposed method.

GRFeb 22, 2016
Creating Simplified 3D Models with High Quality Textures

Song Liu, Wanqing Li, Philip Ogunbona et al.

This paper presents an extension to the KinectFusion algorithm which allows creating simplified 3D models with high quality RGB textures. This is achieved through (i) creating model textures using images from an HD RGB camera that is calibrated with Kinect depth camera, (ii) using a modified scheme to update model textures in an asymmetrical colour volume that contains a higher number of voxels than that of the geometry volume, (iii) simplifying dense polygon mesh model using quadric-based mesh decimation algorithm, and (iv) creating and mapping 2D textures to every polygon in the output 3D model. The proposed method is implemented in real-time by means of GPU parallel processing. Visualization via ray casting of both geometry and colour volumes provides users with a real-time feedback of the currently scanned 3D model. Experimental results show that the proposed method is capable of keeping the model texture quality even for a heavily decimated model and that, when reconstructing small objects, photorealistic RGB textures can still be reconstructed.

CVJun 23, 2015
Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Luping Zhou, Lei Wang, Lingqiao Liu et al.

Due to its causal semantics, Bayesian networks (BN) have been widely employed to discover the underlying data relationship in exploratory studies, such as brain research. Despite its success in modeling the probability distribution of variables, BN is naturally a generative model, which is not necessarily discriminative. This may cause the ignorance of subtle but critical network changes that are of investigation values across populations. In this paper, we propose to improve the discriminative power of BN models for continuous variables from two different perspectives. This brings two general discriminative learning frameworks for Gaussian Bayesian networks (GBN). In the first framework, we employ Fisher kernel to bridge the generative models of GBN and the discriminative classifiers of SVMs, and convert the GBN parameter learning to Fisher kernel learning via minimizing a generalization error bound of SVMs. In the second framework, we employ the max-margin criterion and build it directly upon GBN models to explicitly optimize the classification performance of the GBNs. The advantages and disadvantages of the two frameworks are discussed and experimentally compared. Both of them demonstrate strong power in learning discriminative parameters of GBNs for neuroimaging based brain network analysis, as well as maintaining reasonable representation capacity. The contributions of this paper also include a new Directed Acyclic Graph (DAG) constraint with theoretical guarantee to ensure the graph validity of GBN.

CVJan 20, 2015
Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Pichao Wang, Wanqing Li, Zhimin Gao et al.

Recently, deep learning approach has achieved promising results in various fields of computer vision. In this paper, a new framework called Hierarchical Depth Motion Maps (HDMM) + 3 Channel Deep Convolutional Neural Networks (3ConvNets) is proposed for human action recognition using depth map sequences. Firstly, we rotate the original depth data in 3D pointclouds to mimic the rotation of cameras, so that our algorithms can handle view variant cases. Secondly, in order to effectively extract the body shape and motion information, we generate weighted depth motion maps (DMM) at several temporal scales, referred to as Hierarchical Depth Motion Maps (HDMM). Then, three channels of ConvNets are trained on the HDMMs from three projected orthogonal planes separately. The proposed algorithms are evaluated on MSRAction3D, MSRAction3DExt, UTKinect-Action and MSRDailyActivity3D datasets respectively. We also combine the last three datasets into a larger one (called Combined Dataset) and test the proposed method on it. The results show that our approach can achieve state-of-the-art results on the individual datasets and without dramatical performance degradation on the Combined Dataset.

CVSep 14, 2014
Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation

Pichao Wang, Wanqing Li, Philip Ogunbona et al.

Recently, mid-level features have shown promising performance in computer vision. Mid-level features learned by incorporating class-level information are potentially more discriminative than traditional low-level local features. In this paper, an effective method is proposed to extract mid-level features from Kinect skeletons for 3D human action recognition. Firstly, the orientations of limbs connected by two skeleton joints are computed and each orientation is encoded into one of the 27 states indicating the spatial relationship of the joints. Secondly, limbs are combined into parts and the limb's states are mapped into part states. Finally, frequent pattern mining is employed to mine the most frequent and relevant (discriminative, representative and non-redundant) states of parts in continuous several frames. These parts are referred to as Frequent Local Parts or FLPs. The FLPs allow us to build powerful bag-of-FLP-based action representation. This new representation yields state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D.

RTAug 2, 2012
A phase-sensitive method for filtering on the sphere

Ramakrishna Kakarala, Philip Ogunbona

This paper examines filtering on a sphere, by first examining the roles of spherical harmonic magnitude and phase. We show that phase is more important than magnitude in determining the structure of a spherical function. We examine the properties of linear phase shifts in the spherical harmonic domain, which suggest a mechanism for constructing finite-impulse-response (FIR) filters. We show that those filters have desirable properties, such as being associative, mapping spherical functions to spherical functions, allowing directional filtering, and being defined by relatively simple equations. We provide examples of the filters for both spherical and manifold data.