Richard Yi Da Xu

LG
h-index8
33papers
216citations
Novelty53%
AI Score43

33 Papers

CVJul 25, 2022Code
Domain Decorrelation with Potential Energy Ranking

Sen Pei, Jiaxi Sun, Richard Yi Da Xu et al.

Machine learning systems, especially the methods based on deep learning, enjoy great success in modern computer vision tasks under experimental settings. Generally, these classic deep learning methods are built on the \emph{i.i.d.} assumption, supposing the training and test data are drawn from a similar distribution independently and identically. However, the aforementioned \emph{i.i.d.} assumption is in general unavailable in the real-world scenario, and as a result, leads to sharp performance decay of deep learning algorithms. Behind this, domain shift is one of the primary factors to be blamed. In order to tackle this problem, we propose using \textbf{Po}tential \textbf{E}nergy \textbf{R}anking (PoER) to decouple the object feature and the domain feature (\emph{i.e.,} appearance feature) in given images, promoting the learning of label-discriminative features while filtering out the irrelevant correlations between the objects and the background. PoER helps the neural networks to capture label-related features which contain the domain information first in shallow layers and then distills the label-discriminative representations out progressively, enforcing the neural networks to be aware of the characteristic of objects and background which is vital to the generation of domain-invariant features. PoER reports superior performance on domain generalization benchmarks, improving the average top-1 accuracy by at least 1.20\% compared to the existing methods. Moreover, we use PoER in the ECCV 2022 NICO Challenge\footnote{https://nicochallenge.com}, achieving top place with only a vanilla ResNet-18. The code has been made available at https://github.com/ForeverPs/PoER.

LGJan 18, 2023
A variational autoencoder-based nonnegative matrix factorisation model for deep dictionary learning

Hong-Bo Xie, Caoyuan Li, Shuliang Wang et al.

Construction of dictionaries using nonnegative matrix factorisation (NMF) has extensive applications in signal processing and machine learning. With the advances in deep learning, training compact and robust dictionaries using deep neural networks, i.e., dictionaries of deep features, has been proposed. In this study, we propose a probabilistic generative model which employs a variational autoencoder (VAE) to perform nonnegative dictionary learning. In contrast to the existing VAE models, we cast the model under a statistical framework with latent variables obeying a Gamma distribution and design a new loss function to guarantee the nonnegative dictionaries. We adopt an acceptance-rejection sampling reparameterization trick to update the latent variables iteratively. We apply the dictionaries learned from VAE-NMF to two signal processing tasks, i.e., enhancement of speech and extraction of muscle synergies. Experimental results demonstrate that VAE-NMF performs better in learning the latent nonnegative dictionaries in comparison with state-of-the-art methods.

CVJan 17, 2023
Free Lunch for Generating Effective Outlier Supervision

Sen Pei, Jiaxi Sun, Richard Yi Da Xu et al.

When deployed in practical applications, computer vision systems will encounter numerous unexpected images (\emph{i.e.}, out-of-distribution data). Due to the potentially raised safety risks, these aforementioned unseen data should be carefully identified and handled. Generally, existing approaches in dealing with out-of-distribution (OOD) detection mainly focus on the statistical difference between the features of OOD and in-distribution (ID) data extracted by the classifiers. Although many of these schemes have brought considerable performance improvements, reducing the false positive rate (FPR) when processing open-set images, they necessarily lack reliable theoretical analysis and generalization guarantees. Unlike the observed ways, in this paper, we investigate the OOD detection problem based on the Bayes rule and present a convincing description of the reason for failures encountered by conventional classifiers. Concretely, our analysis reveals that refining the probability distribution yielded by the vanilla neural networks is necessary for OOD detection, alleviating the issues of assigning high confidence to OOD data. To achieve this effortlessly, we propose an ultra-effective method to generate near-realistic outlier supervision. Extensive experiments on large-scale benchmarks reveal that our proposed \texttt{BayesAug} significantly reduces the FPR95 over 12.50\% compared with the previous schemes, boosting the reliability of machine learning systems. The code will be made publicly available.

CVAug 16, 2024
TsCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning

Miaoge Li, Jingcai Guo, Richard Yi Da Xu et al.

Compositional Zero-Shot Learning (CZSL) aims to recognize novel state-object compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.

CLSep 26, 2023
KERMIT: Knowledge Graph Completion of Enhanced Relation Modeling with Inverse Transformation

Haotian Li, Bin Yu, Yuliang Wei et al.

Knowledge graph completion (KGC) revolves around populating missing triples in a knowledge graph using available information. Text-based methods, which depend on textual descriptions of triples, often encounter difficulties when these descriptions lack sufficient information for accurate prediction-an issue inherent to the datasets and not easily resolved through modeling alone. To address this and ensure data consistency, we first use large language models (LLMs) to generate coherent descriptions, bridging the semantic gap between queries and answers. Secondly, we utilize inverse relations to create a symmetric graph, thereby providing augmented training samples for KGC. Additionally, we employ the label information inherent in knowledge graphs (KGs) to enhance the existing contrastive framework, making it fully supervised. These efforts have led to significant performance improvements on the WN18RR and FB15k-237 datasets. According to standard evaluation metrics, our approach achieves a 4.2% improvement in Hit@1 on WN18RR and a 3.4% improvement in Hit@3 on FB15k-237, demonstrating superior performance.

43.5LGMay 13
Rethinking Generalization in Graph Neural Networks: A Structural Complexity Perspective

Peiyao Wang, Liang Bai, Xian Yang et al.

Graph neural networks (GNNs) have emerged as a fundamental tool for learning from graph-structured data, achieving strong performance across a wide range of applications. However, understanding their generalization capabilities remains challenging due to the complex structural dependencies inherent in such data. Existing generalization analyses largely follow the classical machine learning paradigm, focusing primarily on model complexity while overlooking the fundamental role of graph structure. Therefore, in this work, we systematically investigate this role by asking: does the graph structure actually influence generalization, and if so, by how much? To answer the first question and validate our intuition, we theoretically prove that incorporating more edges into the prediction process transforms the input representations to be overly accommodating to the output model, thereby inducing overfitting. To address the second question, we formulate a structural complexity measure based on the number of effective edges and derive a Rademacher complexity-based generalization bound. In doing so, we demonstrate that GNN generalization depends explicitly on structural complexity, alongside traditional parameter-dependent factors. Motivated by these theoretical findings, we propose a structural entropy regularization method. This approach controls structural complexity by regulating effective edges to balance underfitting and overfitting, ultimately improving the generalization performance of GNNs.

CLNov 24, 2024
Deep Sparse Latent Feature Models for Knowledge Graph Completion

Haotian Li, Rui Zhang, Lingzhi Wang et al.

Recent advances in knowledge graph completion (KGC) have emphasized text-based approaches to navigate the inherent complexities of large-scale knowledge graphs (KGs). While these methods have achieved notable progress, they frequently struggle to fully incorporate the global structural properties of the graph. Stochastic blockmodels (SBMs), especially the latent feature relational model (LFRM), offer robust probabilistic frameworks for identifying latent community structures and improving link prediction. This paper presents a novel probabilistic KGC framework utilizing sparse latent feature models, optimized via a deep variational autoencoder (VAE). Our proposed method dynamically integrates global clustering information with local textual features to effectively complete missing triples, while also providing enhanced interpretability of the underlying latent structures. Extensive experiments on four benchmark datasets with varying scales demonstrate the significant performance gains achieved by our method.

LGFeb 4, 2022
Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning

Wei Huang, Chunrui Liu, Yilan Chen et al.

PAC-Bayesian is an analysis framework where the training error can be expressed as the weighted average of the hypotheses in the posterior distribution whilst incorporating the prior knowledge. In addition to being a pure generalization bound analysis tool, PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, making them a powerful and relevant framework that can numerically provide a tight generalization bound for supervised learning. For simplicity, we call probabilistic neural network learned using training objectives derived from PAC-Bayesian bounds as {\it PAC-Bayesian learning}. Despite their empirical success, the theoretical analysis of PAC-Bayesian learning for neural networks is rarely explored. This paper proposes a new class of convergence and generalization analysis for PAC-Bayes learning when it is used to train the over-parameterized neural networks by the gradient descent method. For a wide probabilistic neural network, we show that when PAC-Bayes learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further characterize the uniform PAC-Bayesian generalization bound which improves over the Rademacher complexity-based bound for non-probabilistic neural network. Finally, drawing the insight from our theoretical results, we propose a proxy measure for efficient hyperparameters selection, which is proven to be time-saving.

LGSep 19, 2021
Regularize! Don't Mix: Multi-Agent Reinforcement Learning without Explicit Centralized Structures

Chapman Siu, Jason Traish, Richard Yi Da Xu

We propose using regularization for Multi-Agent Reinforcement Learning rather than learning explicit cooperative structures called {\em Multi-Agent Regularized Q-learning} (MARQ). Many MARL approaches leverage centralized structures in order to exploit global state information or removing communication constraints when the agents act in a decentralized manner. Instead of learning redundant structures which is removed during agent execution, we propose instead to leverage shared experiences of the agents to regularize the individual policies in order to promote structured exploration. We examine several different approaches to how MARQ can either explicitly or implicitly regularize our policies in a multi-agent setting. MARQ aims to address these limitations in the MARL context through applying regularization constraints which can correct bias in off-policy out-of-distribution agent experiences and promote diverse exploration. Our algorithm is evaluated on several benchmark multi-agent environments and we show that MARQ consistently outperforms several baselines and state-of-the-art algorithms; learning in fewer steps and converging to higher returns.

LGSep 19, 2021
Dual Behavior Regularized Reinforcement Learning

Chapman Siu, Jason Traish, Richard Yi Da Xu

Reinforcement learning has been shown to perform a range of complex tasks through interaction with an environment or collected leveraging experience. However, many of these approaches presume optimal or near optimal experiences or the presence of a consistent environment. In this work we propose dual, advantage-based behavior policy based on counterfactual regret minimization. We demonstrate the flexibility of this approach and how it can be adapted to online contexts where the environment is available to collect experiences and a variety of other contexts. We demonstrate this new algorithm can outperform several strong baseline models in different contexts based on a range of continuous environments. Additional ablations provide insights into how our dual behavior regularized reinforcement learning approach is designed compared with other plausible modifications and demonstrates its ability to generalize.

LGSep 19, 2021
Greedy UnMixing for Q-Learning in Multi-Agent Reinforcement Learning

Chapman Siu, Jason Traish, Richard Yi Da Xu

This paper introduces Greedy UnMix (GUM) for cooperative multi-agent reinforcement learning (MARL). Greedy UnMix aims to avoid scenarios where MARL methods fail due to overestimation of values as part of the large joint state-action space. It aims to address this through a conservative Q-learning approach through restricting the state-marginal in the dataset to avoid unobserved joint state action spaces, whilst concurrently attempting to unmix or simplify the problem space under the centralized training with decentralized execution paradigm. We demonstrate the adherence to Q-function lower bounds in the Q-learning for MARL scenarios, and demonstrate superior performance to existing Q-learning MARL approaches as well as more general MARL algorithms over a set of benchmark MARL tasks, despite its relative simplicity compared with state-of-the-art approaches.

CVAug 5, 2021
Alleviating Mode Collapse in GAN via Diversity Penalty Module

Sen Pei, Richard Yi Da Xu, Shiming Xiang et al.

The vanilla GAN (Goodfellow et al. 2014) suffers from mode collapse deeply, which usually manifests as that the images generated by generators tend to have a high similarity amongst them, even though their corresponding latent vectors have been very different. In this paper, we introduce a pluggable diversity penalty module (DPM) to alleviate mode collapse of GANs. It reduces the similarity of image pairs in feature space, i.e., if two latent vectors are different, then we enforce the generator to generate two images with different features. The normalized Gram matrix is used to measure the similarity. We compare the proposed method with Unrolled GAN (Metz et al. 2016), BourGAN (Xiao, Zhong, and Zheng 2018), PacGAN (Lin et al. 2018), VEEGAN (Srivastava et al. 2017) and ALI (Dumoulin et al. 2016) on 2D synthetic dataset, and results show that the diversity penalty module can help GAN capture much more modes of the data distribution. Further, in classification tasks, we apply this method as image data augmentation on MNIST, Fashion- MNIST and CIFAR-10, and the classification testing accuracy is improved by 0.24%, 1.34% and 0.52% compared with WGAN GP (Gulrajani et al. 2017), respectively. In domain translation, diversity penalty module can help StarGAN (Choi et al. 2018) generate more accurate attention masks and accelarate the convergence process. Finally, we quantitatively evaluate the proposed method with IS and FID on CelebA, CIFAR-10, MNIST and Fashion-MNIST, and the results suggest GAN with diversity penalty module gets much higher IS and lower FID compared with some SOTA GAN architectures.

LGMar 3, 2021
Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective

Wei Huang, Yayong Li, Weitao Du et al.

Graph convolutional networks (GCNs) and their variants have achieved great success in dealing with graph-structured data. Nevertheless, it is well known that deep GCNs suffer from the over-smoothing problem, where node representations tend to be indistinguishable as more layers are stacked up. The theoretical research to date on deep GCNs has focused primarily on expressive power rather than trainability, an optimization perspective. Compared to expressivity, trainability attempts to address a more fundamental question: Given a sufficiently expressive space of models, can we successfully find a good solution via gradient descent-based optimizers? This work fills this gap by exploiting the Graph Neural Tangent Kernel (GNTK), which governs the optimization trajectory under gradient descent for wide GCNs. We formulate the asymptotic behaviors of GNTK in the large depth, which enables us to reveal the dropping trainability of wide and deep GCNs at an exponential rate in the optimization process. Additionally, we extend our theoretical framework to analyze residual connection-based techniques, which are found to be merely able to mitigate the exponential decay of trainability mildly. Inspired by our theoretical insights on trainability, we propose Critical DropEdge, a connectivity-aware and graph-adaptive sampling method, to alleviate the exponential decay problem more fundamentally. Experimental evaluation consistently confirms using our proposed method can achieve better results compared to relevant counterparts with both infinite-width and finite-width.

CVJan 12, 2021
Resolution-invariant Person ReID Based on Feature Transformation and Self-weighted Attention

Ziyue Zhang, Shuai Jiang, Congzhentao Huang et al.

Person Re-identification (ReID) is a critical computer vision task which aims to match the same person in images or video sequences. Most current works focus on settings where the resolution of images is kept the same. However, the resolution is a crucial factor in person ReID, especially when the cameras are at different distances from the person or the camera's models are different from each other. In this paper, we propose a novel two-stream network with a lightweight resolution association ReID feature transformation (RAFT) module and a self-weighted attention (SWA) ReID module to evaluate features under different resolutions. RAFT transforms the low resolution features to corresponding high resolution features. SWA evaluates both features to get weight factors for the person ReID. Both modules are jointly trained to get a resolution-invariant representation. Extensive experiments on five benchmark datasets show the effectiveness of our method. For instance, we achieve Rank-1 accuracy of 43.3% and 83.2% on CAVIAR and MLR-CUHK03, outperforming the state-of-the-art.

LGNov 25, 2020
Implicit bias of deep linear networks in the large learning rate phase

Wei Huang, Weitao Du, Richard Yi Da Xu et al.

Most theoretical studies explaining the regularization effect in deep learning have only focused on gradient descent with a sufficient small learning rate or even gradient flow (infinitesimal learning rate). Such researches, however, have neglected a reasonably large learning rate applied in most practical applications. In this work, we characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. [26] in a regression setting with squared loss. They found a learning rate regime with a large stepsize named the catapult phase, where the loss grows at the early stage of training and eventually converges to a minimum that is flatter than those found in the small learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase. We rigorously prove this claim under the assumption of degenerate data by overcoming the difficulty of the non-constant Hessian of logistic loss and further characterize the behavior of loss and Hessian for non-separable data. Finally, we demonstrate that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.

CVJul 15, 2020
RGB-IR Cross-modality Person ReID based on Teacher-Student GAN Model

Ziyue Zhang, Shuai Jiang, Congzhentao Huang et al.

RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone to learn better ReID information. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model's capability, we did extensive experiments on the newly-released SYSU-MM01 RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 49.8% Rank-1 and 47.4% mAP.

LGApr 13, 2020
On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

Wei Huang, Weitao Du, Richard Yi Da Xu

The prevailing thinking is that orthogonal weights are crucial to enforcing dynamical isometry and speeding up training. The increase in learning speed that results from orthogonal initialization in linear networks has been well-proven. However, while the same is believed to also hold for nonlinear networks when the dynamical isometry condition is satisfied, the training dynamics behind this contention have not been thoroughly explored. In this work, we study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel (NTK). Through a series of propositions and lemmas, we prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. Further, during training, the NTK of an orthogonally-initialized infinite-width network should theoretically remain constant. This suggests that the orthogonal initialization cannot speed up training in the NTK (lazy training) regime, contrary to the prevailing thoughts. In order to explore under what circumstances can orthogonality accelerate training, we conduct a thorough empirical investigation outside the NTK regime. We find that when the hyper-parameters are set to achieve a linear regime in nonlinear activation, orthogonal initialization can improve the learning speed with a large learning rate or large depth.

LGDec 19, 2019
Gaussian Process Latent Variable Model Factorization for Context-aware Recommender Systems

Wei Huang, Richard Yi Da Xu

Context-aware recommender systems (CARS) have gained increasing attention due to their ability to utilize contextual information. Compared to traditional recommender systems, CARS are, in general, able to generate more accurate recommendations. Latent factors approach accounts for a large proportion of CARS. Recently, a non-linear Gaussian Process (GP) based factorization method was proven to outperform the state-of-the-art methods in CARS. Despite its effectiveness, GP model-based methods can suffer from over-fitting and may not be able to determine the impact of each context automatically. In order to address such shortcomings, we propose a Gaussian Process Latent Variable Model Factorization (GPLVMF) method, where we apply an appropriate prior to the original GP model. Our work is primarily inspired by the Gaussian Process Latent Variable Model (GPLVM), which was a non-linear dimensionality reduction method. As a result, we improve the performance on the real datasets significantly as well as capturing the importance of each context. In addition to the general advantages, our method provides two main contributions regarding recommender system settings: (1) addressing the influence of bias by setting a non-zero mean function, and (2) utilizing real-valued contexts by fixing the latent space with real values.

LGDec 19, 2019
Mean field theory for deep dropout networks: digging up gradient backpropagation deeply

Wei Huang, Richard Yi Da Xu, Weitao Du et al.

In recent years, the mean field theory has been applied to the study of neural networks and has achieved a great deal of success. The theory has been applied to various neural network structures, including CNNs, RNNs, Residual networks, and Batch normalization. Inevitably, recent work has also covered the use of dropout. The mean field theory shows that the existence of depth scales that limit the maximum depth of signal propagation and gradient backpropagation. However, the gradient backpropagation is derived under the gradient independence assumption that weights used during feed forward are drawn independently from the ones used in backpropagation. This is not how neural networks are trained in a real setting. Instead, the same weights used in a feed-forward step needs to be carried over to its corresponding backpropagation. Using this realistic condition, we perform theoretical computation on linear dropout networks and a series of experiments on dropout networks. Our empirical results show an interesting phenomenon that the length gradients can backpropagate for a single input and a pair of inputs are governed by the same depth scale. Besides, we study the relationship between variance and mean of statistical metrics of the gradient and shown an emergence of universality. Finally, we investigate the maximum trainable length for deep dropout networks through a series of experiments using MNIST and CIFAR10 and provide a more precise empirical formula that describes the trainable length than original work.

LGJul 15, 2018
Magnitude Bounded Matrix Factorisation for Recommender Systems

Shuai Jiang, Kan Li, Richard Yi Da Xu

Low rank matrix factorisation is often used in recommender systems as a way of extracting latent features. When dealing with large and sparse datasets, traditional recommendation algorithms face the problem of acquiring large, unrestrained, fluctuating values over predictions especially for users/items with very few corresponding observations. Although the problem has been somewhat solved by imposing bounding constraints over its objectives, and/or over all entries to be within a fixed range, in terms of gaining better recommendations, these approaches have two major shortcomings that we aim to mitigate in this work: one is they can only deal with one pair of fixed bounds for all entries, and the other one is they are very time-consuming when applied on large scale recommender systems. In this paper, we propose a novel algorithm named Magnitude Bounded Matrix Factorisation (MBMF), which allows different bounds for individual users/items and performs very fast on large scale datasets. The key idea of our algorithm is to construct a model by constraining the magnitudes of each individual user/item feature vector. We achieve this by converting from the Cartesian to Spherical coordinate system with radii set as the corresponding magnitudes, which allows the above constrained optimisation problem to become an unconstrained one. The Stochastic Gradient Descent (SGD) method is then applied to solve the unconstrained task efficiently. Experiments on synthetic and real datasets demonstrate that in most cases the proposed MBMF is superior over all existing algorithms in terms of accuracy and time complexity.

MLJun 12, 2018
Diverse Online Feature Selection

Chapman Siu, Richard Yi Da Xu

Online feature selection has been an active research area in recent years. We propose a novel diverse online feature selection method based on Determinantal Point Processes (DPP). Our model aims to provide diverse features which can be composed in either a supervised or unsupervised framework. The framework aims to promote diversity based on the kernel produced on a feature level, through at most three stages: feature sampling, local criteria and global criteria for feature selection. In the feature sampling, we sample incoming stream of features using conditional DPP. The local criteria is used to assess and select streamed features (i.e. only when they arrive), we use unsupervised scale invariant methods to remove redundant features and optionally supervised methods to introduce label information to assess relevant features. Lastly, the global criteria uses regularization methods to select a global optimal subset of features. This three stage procedure continues until there are no more features arriving or some predefined stopping condition is met. We demonstrate based on experiments conducted on that this approach yields better compactness, is comparable and in some instances outperforms other state-of-the-art online feature selection methods.

LGJul 18, 2017
Cooperative Hierarchical Dirichlet Processes: Superposition vs. Maximization

Junyu Xuan, Jie Lu, Guangquan Zhang et al.

The cooperative hierarchical structure is a common and significant data structure observed in, or adopted by, many research areas, such as: text mining (author-paper-word) and multi-label classification (label-instance-feature). Renowned Bayesian approaches for cooperative hierarchical structure modeling are mostly based on topic models. However, these approaches suffer from a serious issue in that the number of hidden topics/factors needs to be fixed in advance and an inappropriate number may lead to overfitting or underfitting. One elegant way to resolve this issue is Bayesian nonparametric learning, but existing work in this area still cannot be applied to cooperative hierarchical structure modeling. In this paper, we propose a cooperative hierarchical Dirichlet process (CHDP) to fill this gap. Each node in a cooperative hierarchical structure is assigned a Dirichlet process to model its weights on the infinite hidden factors/topics. Together with measure inheritance from hierarchical Dirichlet process, two kinds of measure cooperation, i.e., superposition and maximization, are defined to capture the many-to-many relationships in the cooperative hierarchical structure. Furthermore, two constructive representations for CHDP, i.e., stick-breaking and international restaurant process, are designed to facilitate the model inference. Experiments on synthetic and real-world data with cooperative hierarchical structures demonstrate the properties and the ability of CHDP for cooperative hierarchical structure modeling and its potential for practical application scenarios.

MLJun 27, 2016
The Dependent Random Measures with Independent Increments in Mixture Models

Cheng Luo, Richard Yi Da Xu, Yang Xiang

When observations are organized into groups where commonalties exist amongst them, the dependent random measures can be an ideal choice for modeling. One of the propositions of the dependent random measures is that the atoms of the posterior distribution are shared amongst groups, and hence groups can borrow information from each other. When normalized dependent random measures prior with independent increments are applied, we can derive appropriate exchangeable probability partition function (EPPF), and subsequently also deduce its inference algorithm given any mixture model likelihood. We provide all necessary derivation and solution to this framework. For demonstration, we used mixture of Gaussians likelihood in combination with a dependent structure constructed by linear combinations of CRMs. Our experiments show superior performance when using this framework, where the inferred values including the mixing weights and the number of clusters both respond appropriately to the number of completely random measure used.

MLApr 16, 2016
Smoothed Hierarchical Dirichlet Process: A Non-Parametric Approach to Constraint Measures

Cheng Luo, Yang Xiang, Richard Yi Da Xu

Time-varying mixture densities occur in many scenarios, for example, the distributions of keywords that appear in publications may evolve from year to year, video frame features associated with multiple targets may evolve in a sequence. Any models that realistically cater to this phenomenon must exhibit two important properties: the underlying mixture densities must have an unknown number of mixtures, and there must be some "smoothness" constraints in place for the adjacent mixture densities. The traditional Hierarchical Dirichlet Process (HDP) may be suited to the first property, but certainly not the second. This is due to how each random measure in the lower hierarchies is sampled independent of each other and hence does not facilitate any temporal correlations. To overcome such shortcomings, we proposed a new Smoothed Hierarchical Dirichlet Process (sHDP). The key novelty of this model is that we place a temporal constraint amongst the nearby discrete measures $\{G_j\}$ in the form of symmetric Kullback-Leibler (KL) Divergence with a fixed bound $B$. Although the constraint we place only involves a single scalar value, it nonetheless allows for flexibility in the corresponding successive measures. Remarkably, it also led us to infer the model within the stick-breaking process where the traditional Beta distribution used in stick-breaking is now replaced by a new constraint calculated from $B$. We present the inference algorithm and elaborate on its solutions. Our experiment using NIPS keywords has shown the desirable effect of the model.

MLFeb 9, 2016
Bayesian nonparametric image segmentation using a generalized Swendsen-Wang algorithm

Richard Yi Da Xu, Francois Caron, Arnaud Doucet

Unsupervised image segmentation aims at clustering the set of pixels of an image into spatially homogeneous regions. We introduce here a class of Bayesian nonparametric models to address this problem. These models are based on a combination of a Potts-like spatial smoothness component and a prior on partitions which is used to control both the number and size of clusters. This class of models is flexible enough to include the standard Potts model and the more recent Potts-Dirichlet Process model \cite{Orbanz2008}. More importantly, any prior on partitions can be introduced to control the global clustering structure so that it is possible to penalize small or large clusters if necessary. Bayesian computation is carried out using an original generalized Swendsen-Wang algorithm. Experiments demonstrate that our method is competitive in terms of RAND\ index compared to popular image segmentation methods, such as mean-shift, and recent alternative Bayesian nonparametric models.

MLJul 12, 2015
Dependent Indian Buffet Process-based Sparse Nonparametric Nonnegative Matrix Factorization

Junyu Xuan, Jie Lu, Guangquan Zhang et al.

Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two optimized nonnegative matrices appropriate for the intended applications. The method has been widely used for unsupervised learning tasks, including recommender systems (rating matrix of users by items) and document clustering (weighting matrix of papers by keywords). However, traditional NMF methods typically assume the number of latent factors (i.e., dimensionality of the loading matrices) to be fixed. This assumption makes them inflexible for many applications. In this paper, we propose a nonparametric NMF framework to mitigate this issue by using dependent Indian Buffet Processes (dIBP). In a nutshell, we apply a correlation function for the generation of two stick weights associated with each pair of columns of loading matrices, while still maintaining their respective marginal distribution specified by IBP. As a consequence, the generation of two loading matrices will be column-wise (indirectly) correlated. Under this same framework, two classes of correlation function are proposed (1) using Bivariate beta distribution and (2) using Copula function. Both methods allow us to adopt our work for various applications by flexibly choosing an appropriate parameter settings. Compared with the other state-of-the art approaches in this area, such as using Gaussian Process (GP)-based dIBP, our work is seen to be much more flexible in terms of allowing the two corresponding binary matrix columns to have greater variations in their non-zero entries. Our experiments on the real-world and synthetic datasets show that three proposed models perform well on the document clustering task comparing standard NMF without predefining the dimension for the factor matrices, and the Bivariate beta distribution-based and Copula-based models have better flexibility than the GP-based model.

MLMar 30, 2015
Nonparametric Relational Topic Models through Dependent Gamma Processes

Junyu Xuan, Jie Lu, Guangquan Zhang et al.

Traditional Relational Topic Models provide a way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, link prediction, benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known in advance, and this is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model in this paper. Instead of using fixed-dimensional probability distributions in its generative model, we use stochastic processes. Specifically, a gamma process is assigned to each document, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different gamma process from the global gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics.

MLMar 30, 2015
Infinite Author Topic Model based on Mixed Gamma-Negative Binomial Process

Junyu Xuan, Jie Lu, Guangquan Zhang et al.

Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors's interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in order to capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author's contribution towards this document. An efficient Gibbs sampling inference algorithm with each conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors' interests on these topics and the number of topics simultaneously.

MLMar 10, 2015
An Adaptive Online HDP-HMM for Segmentation and Classification of Sequential Data

Ava Bargi, Richard Yi Da Xu, Massimo Piccardi

In the recent years, the desire and need to understand sequential data has been increasing, with particular interest in sequential contexts such as patient monitoring, understanding daily activities, video surveillance, stock market and the like. Along with the constant flow of data, it is critical to classify and segment the observations on-the-fly, without being limited to a rigid number of classes. In addition, the model needs to be capable of updating its parameters to comply with possible evolutions. This interesting problem, however, is not adequately addressed in the literature since many studies focus on offline classification over a pre-defined class set. In this paper, we propose a principled solution to this gap by introducing an adaptive online system based on Markov switching models with hierarchical Dirichlet process priors. This infinite adaptive online approach is capable of segmenting and classifying the sequential data over unlimited number of classes, while meeting the memory and delay constraints of streaming contexts. The model is further enhanced by introducing a learning rate, responsible for balancing the extent to which the model sustains its previous learning (parameters) or adapts to the new streaming observations. Experimental results on several variants of stationary and evolving synthetic data and two video datasets, TUM Assistive Kitchen and collatedWeizmann, show remarkable performance in segmentation and classification, particularly for evolutionary sequences with changing distributions and/or containing new, unseen classes.

LGOct 6, 2013
Learning Hidden Structures with Relational Models by Adequately Involving Rich Information in A Network

Xuhui Fan, Richard Yi Da Xu, Longbing Cao et al.

Effectively modelling hidden structures in a network is very practical but theoretically challenging. Existing relational models only involve very limited information, namely the binary directional link data, embedded in a network to learn hidden networking structures. There is other rich and meaningful information (e.g., various attributes of entities and more granular information than binary elements such as "like" or "dislike") missed, which play a critical role in forming and understanding relations in a network. In this work, we propose an informative relational model (InfRM) framework to adequately involve rich information and its granularity in a network, including metadata information about each entity and various forms of link data. Firstly, an effective metadata information incorporation method is employed on the prior information from relational models MMSB and LFRM. This is to encourage the entities with similar metadata information to have similar hidden structures. Secondly, we propose various solutions to cater for alternative forms of link data. Substantial efforts have been made towards modelling appropriateness and efficiency, for example, using conjugate priors. We evaluate our framework and its inference algorithms in different datasets, which shows the generality and effectiveness of our models in capturing implicit structures in networks.

MLJul 2, 2013
A non-parametric conditional factor regression model for high-dimensional input and response

Ava Bargi, Richard Yi Da Xu, Massimo Piccardi

In this paper, we propose a non-parametric conditional factor regression (NCFR)model for domains with high-dimensional input and response. NCFR enhances linear regression in two ways: a) introducing low-dimensional latent factors leading to dimensionality reduction and b) integrating an Indian Buffet Process as a prior for the latent factors to derive unlimited sparse dimensions. Experimental results comparing NCRF to several alternatives give evidence to remarkable prediction performance.

SIJun 13, 2013
Dynamic Infinite Mixed-Membership Stochastic Blockmodel

Xuhui Fan, Longbing Cao, Richard Yi Da Xu

Directional and pairwise measurements are often used to model inter-relationships in a social network setting. The Mixed-Membership Stochastic Blockmodel (MMSB) was a seminal work in this area, and many of its capabilities were extended since then. In this paper, we propose the \emph{Dynamic Infinite Mixed-Membership stochastic blockModel (DIM3)}, a generalised framework that extends the existing work to a potentially infinite number of communities and mixture memberships for each of the network's nodes. This model is in a dynamic setting, where additional model parameters are introduced to reflect the degree of persistence between one's memberships at consecutive times. Accordingly, two effective posterior sampling strategies and their results are presented using both synthetic and real data.

LGJun 12, 2013
Copula Mixed-Membership Stochastic Blockmodel for Intra-Subgroup Correlations

Xuhui Fan, Longbing Cao, Richard Yi Da Xu

The \emph{Mixed-Membership Stochastic Blockmodel (MMSB)} is a popular framework for modeling social network relationships. It can fully exploit each individual node's participation (or membership) in a social structure. Despite its powerful representations, this model makes an assumption that the distributions of relational membership indicators between two nodes are independent. Under many social network settings, however, it is possible that certain known subgroups of people may have high or low correlations in terms of their membership categories towards each other, and such prior information should be incorporated into the model. To this end, we introduce a \emph{Copula Mixed-Membership Stochastic Blockmodel (cMMSB)} where an individual Copula function is employed to jointly model the membership pairs of those nodes within the subgroup of interest. The model enables the use of various Copula functions to suit the scenario, while maintaining the membership's marginal distribution, as needed, for modeling membership indicators with other nodes outside of the subgroup of interest. We describe the proposed model and its inference algorithm in detail for both the finite and infinite cases. In the experiment section, we compare our algorithms with other popular models in terms of link prediction, using both synthetic and real world data.