LGJan 6, 2021
Representation learning for maximization of MI, nonlinear ICA and nonlinear subspaces with robust density ratio estimationHiroaki Sasaki, Takashi Takenouchi
Contrastive learning is a recent promising approach in unsupervised representation learning where a feature representation of data is learned by solving a pseudo classification problem from unlabelled data. However, it is not straightforward to understand what representation contrastive learning yields. In addition, contrastive learning is often based on the maximum likelihood estimation, which tends to be vulnerable to the contamination by outliers. To promote the understanding to contrastive learning, this paper first theoretically shows a connection to maximization of mutual information (MI). Our result indicates that density ratio estimation is necessary and sufficient for maximization of MI under some conditions. Thus, contrastive learning related to density ratio estimation as done in popular objective functions can be interpreted as maximizing MI. Next, with the density ratio, we establish new recovery conditions for the latent source components in nonlinear independent component analysis (ICA). In contrast with existing work, the established conditions include a novel insight for the dimensionality of data, which is clearly supported by numerical experiments. Furthermore, inspired by nonlinear ICA, we propose a novel framework to estimate a nonlinear subspace for lower-dimensional latent source components, and some theoretical conditions for the subspace estimation are established with the density ratio. Then, we propose a practical method through outlier-robust density ratio estimation, which can be seen as performing maximization of MI, nonlinear ICA or nonlinear subspace estimation. Moreover, a sample-efficient nonlinear ICA method is also proposed. We theoretically investigate outlier-robustness of the proposed methods. Finally, the usefulness of the proposed methods is numerically demonstrated in nonlinear ICA and through application to linear classification.
LGNov 1, 2019
Robust contrastive learning and nonlinear ICA in the presence of outliersHiroaki Sasaki, Takashi Takenouchi, Ricardo Monti et al.
Nonlinear independent component analysis (ICA) is a general framework for unsupervised representation learning, and aimed at recovering the latent variables in data. Recent practical methods perform nonlinear ICA by solving a series of classification problems based on logistic regression. However, it is well-known that logistic regression is vulnerable to outliers, and thus the performance can be strongly weakened by outliers. In this paper, we first theoretically analyze nonlinear ICA models in the presence of outliers. Our analysis implies that estimation in nonlinear ICA can be seriously hampered when outliers exist on the tails of the (noncontaminated) target density, which happens in a typical case of contamination by outliers. We develop two robust nonlinear ICA methods based on the γ-divergence, which is a robust alternative to the KL-divergence in logistic regression. The proposed methods are shown to have desired robustness properties in the context of nonlinear ICA. We also experimentally demonstrate that the proposed methods are very robust and outperform existing methods in the presence of outliers. Finally, the proposed method is applied to ICA-based causal discovery and shown to find a plausible causal relationship on fMRI data.
MLOct 18, 2019
Robust modal regression with direct log-density derivative estimationHiroaki Sasaki, Tomoya Sakai, Takafumi Kanamori
Modal regression is aimed at estimating the global mode (i.e., global maximum) of the conditional density function of the output variable given input variables, and has led to regression methods robust against heavy-tailed or skewed noises. The conditional mode is often estimated through maximization of the modal regression risk (MRR). In order to apply a gradient method for the maximization, the fundamental challenge is accurate approximation of the gradient of MRR, not MRR itself. To overcome this challenge, in this paper, we take a novel approach of directly approximating the gradient of MRR. To approximate the gradient, we develop kernelized and neural-network-based versions of the least-squares log-density derivative estimator, which directly approximates the derivative of the log-density without density estimation. With direct approximation of the MRR gradient, we first propose a modal regression method with kernels, and derive a new parameter update rule based on a fixed-point method. Then, the derived update rule is theoretically proved to have a monotonic hill-climbing property towards the conditional mode. Furthermore, we indicate that our approach of directly approximating the gradient is compatible with recent sophisticated stochastic gradient methods (e.g., Adam), and then propose another modal regression method based on neural networks. Finally, the superior performance of the proposed methods is demonstrated on various artificial and benchmark datasets.
MLJun 5, 2018
Neural-Kernelized Conditional Density EstimationHiroaki Sasaki, Aapo Hyvärinen
Conditional density estimation is a general framework for solving various problems in machine learning. Among existing methods, non-parametric and/or kernel-based methods are often difficult to use on large datasets, while methods based on neural networks usually make restrictive parametric assumptions on the probability densities. Here, we propose a novel method for estimating the conditional density based on score matching. In contrast to existing methods, we employ scalable neural networks, but do not make explicit parametric assumptions on densities. The key challenge in applying score matching to neural networks is computation of the first- and second-order derivatives of a model for the log-density. We tackle this challenge by developing a new neural-kernelized approach, which can be applied on large datasets with stochastic gradient descent, while the reproducing kernels allow for easy computation of the derivatives needed in score matching. We show that the neural-kernelized function approximator has universal approximation capability and that our method is consistent in conditional density estimation. We numerically demonstrate that our method is useful in high-dimensional conditional density estimation, and compares favourably with existing methods. Finally, we prove that the proposed method has interesting connections to two probabilistically principled frameworks of representation learning: Nonlinear sufficient dimension reduction and nonlinear independent component analysis.
MLMay 22, 2018
Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive LearningAapo Hyvarinen, Hiroaki Sasaki, Richard E. Turner
Nonlinear ICA is a fundamental problem for unsupervised representation learning, emphasizing the capacity to recover the underlying latent variables generating the data (i.e., identifiability). Recently, the very first identifiability proofs for nonlinear ICA have been proposed, leveraging the temporal structure of the independent components. Here, we propose a general framework for nonlinear ICA, which, as a special case, can make use of temporal structure. It is based on augmenting the data by an auxiliary variable, such as the time index, the history of the time series, or any other available information. We propose to learn nonlinear ICA by discriminating between true augmented data, or data in which the auxiliary variable has been randomized. This enables the framework to be implemented algorithmically through logistic regression, possibly in a neural network. We provide a comprehensive proof of the identifiability of the model as well as the consistency of our estimation method. The approach not only provides a general theoretical framework combining and generalizing previously proposed nonlinear ICA models and algorithms, but also brings practical advantages.
MLJul 6, 2017
Mode-Seeking Clustering and Density Ridge Estimation via Direct Estimation of Density-Derivative-RatiosHiroaki Sasaki, Takafumi Kanamori, Aapo Hyvärinen et al.
Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate estimation of the ratios of the first- and second-order density derivatives to the density. A naive approach takes a three-step approach of first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator, and division by the estimated density could significantly magnify the estimation error. To cope with these problems, we propose a novel estimator for the \emph{density-derivative-ratios}. The proposed estimator does not involve density estimation, but rather \emph{directly} approximates the ratios of density derivatives of any order. Moreover, we establish a convergence rate of the proposed estimator. Based on the proposed estimator, novel methods both for mode-seeking clustering and density ridge estimation are developed, and the respective convergence rates to the mode and ridge of the underlying density are also established. Finally, we experimentally demonstrate that the developed methods significantly outperform existing methods, particularly for relatively high-dimensional data.
MLMar 3, 2016
Whitening-Free Least-Squares Non-Gaussian Component AnalysisHiroaki Shiino, Hiroaki Sasaki, Gang Niu et al.
Non-Gaussian component analysis (NGCA) is an unsupervised linear dimension reduction method that extracts low-dimensional non-Gaussian "signals" from high-dimensional data contaminated with Gaussian noise. NGCA can be regarded as a generalization of projection pursuit (PP) and independent component analysis (ICA) to multi-dimensional and dependent non-Gaussian components. Indeed, seminal approaches to NGCA are based on PP and ICA. Recently, a novel NGCA approach called least-squares NGCA (LSNGCA) has been developed, which gives a solution analytically through least-squares estimation of log-density gradients and eigendecomposition. However, since pre-whitening of data is involved in LSNGCA, it performs unreliably when the data covariance matrix is ill-conditioned, which is often the case in high-dimensional data analysis. In this paper, we propose a whitening-free LSNGCA method and experimentally demonstrate its superiority.
MLJan 28, 2016
Non-Gaussian Component Analysis with Log-Density Gradient EstimationHiroaki Sasaki, Gang Niu, Masashi Sugiyama
Non-Gaussian component analysis (NGCA) is aimed at identifying a linear subspace such that the projected data follows a non-Gaussian distribution. In this paper, we propose a novel NGCA algorithm based on log-density gradient estimation. Unlike existing methods, the proposed NGCA algorithm identifies the linear subspace by using the eigenvalue decomposition without any iterative procedures, and thus is computationally reasonable. Furthermore, through theoretical analysis, we prove that the identified subspace converges to the true subspace at the optimal parametric rate. Finally, the practical performance of the proposed algorithm is demonstrated on both artificial and benchmark datasets.
MLAug 5, 2015
Direct Estimation of the Derivative of Quadratic Mutual Information with Application in Supervised Dimension ReductionVoot Tangkaratt, Hiroaki Sasaki, Masashi Sugiyama
A typical goal of supervised dimension reduction is to find a low-dimensional subspace of the input space such that the projected input variables preserve maximal information about the output variables. The dependence maximization approach solves the supervised dimension reduction problem through maximizing a statistical dependence between projected input variables and output variables. A well-known statistical dependence measure is mutual information (MI) which is based on the Kullback-Leibler (KL) divergence. However, it is known that the KL divergence is sensitive to outliers. On the other hand, quadratic MI (QMI) is a variant of MI based on the $L_2$ distance which is more robust against outliers than the KL divergence, and a computationally efficient method to estimate QMI from data, called least-squares QMI (LSQMI), has been proposed recently. For these reasons, developing a supervised dimension reduction method based on LSQMI seems promising. However, not QMI itself, but the derivative of QMI is needed for subspace search in supervised dimension reduction, and the derivative of an accurate QMI estimator is not necessarily a good estimator of the derivative of QMI. In this paper, we propose to directly estimate the derivative of QMI without estimating QMI itself. We show that the direct estimation of the derivative of QMI is more accurate than the derivative of the estimated QMI. Finally, we develop a supervised dimension reduction algorithm which efficiently uses the proposed derivative estimator, and demonstrate through experiments that the proposed method is more robust against outliers than existing methods.
MLAug 1, 2015
Regularized Multi-Task Learning for Multi-Dimensional Log-Density Gradient EstimationIkko Yamane, Hiroaki Sasaki, Masashi Sugiyama
Log-density gradient estimation is a fundamental statistical problem and possesses various practical applications such as clustering and measuring non-Gaussianity. A naive two-step approach of first estimating the density and then taking its log-gradient is unreliable because an accurate density estimate does not necessarily lead to an accurate log-density gradient estimate. To cope with this problem, a method to directly estimate the log-density gradient without density estimation has been explored, and demonstrated to work much better than the two-step method. The objective of this paper is to further improve the performance of this direct method in multi-dimensional cases. Our idea is to regard the problem of log-density gradient estimation in each dimension as a task, and apply regularized multi-task learning to the direct log-density gradient estimator. We experimentally demonstrate the usefulness of the proposed multi-task method in log-density gradient estimation and mode-seeking clustering.
MLJun 18, 2015
Simultaneous Estimation of Non-Gaussian Components and their Correlation StructureHiroaki Sasaki, Michael U. Gutmann, Hayaru Shouno et al.
The statistical dependencies which independent component analysis (ICA) cannot remove often provide rich information beyond the linear independent components. It would thus be very useful to estimate the dependency structure from data. While such models have been proposed, they usually concentrated on higher-order correlations such as energy (square) correlations. Yet, linear correlations are a most fundamental and informative form of dependency in many real data sets. Linear correlations are usually completely removed by ICA and related methods, so they can only be analyzed by developing new methods which explicitly allow for linearly correlated components. In this paper, we propose a probabilistic model of linear non-Gaussian components which are allowed to have both linear and energy correlations. The precision matrix of the linear components is assumed to be randomly generated by a higher-order process and explicitly parametrized by a parameter matrix. The estimation of the parameter matrix is shown to be particularly simple because using score matching, the objective function is a quadratic form. Using simulations with artificial data, we demonstrate that the proposed method improves identifiability of non-Gaussian components by simultaneously learning their correlation structure. Applications on simulated complex cells with natural image input, as well as spectrograms of natural audio data show that the method finds new kinds of dependencies between the components.
MLJun 30, 2014
Direct Density-Derivative Estimation and Its Application in KL-Divergence ApproximationHiroaki Sasaki, Yung-Kyun Noh, Masashi Sugiyama
Estimation of density derivatives is a versatile tool in statistical data analysis. A naive approach is to first estimate the density and then compute its derivative. However, such a two-step approach does not work well because a good density estimator does not necessarily mean a good density-derivative estimator. In this paper, we give a direct method to approximate the density derivative without estimating the density itself. Our proposed estimator allows analytic and computationally efficient approximation of multi-dimensional high-order density derivatives, with the ability that all hyper-parameters can be chosen objectively by cross-validation. We further show that the proposed density-derivative estimator is useful in improving the accuracy of non-parametric KL-divergence estimation via metric learning. The practical superiority of the proposed method is experimentally demonstrated in change detection and feature selection.
MLApr 20, 2014
Clustering via Mode Seeking by Direct Estimation of the Gradient of a Log-DensityHiroaki Sasaki, Aapo Hyvärinen, Masashi Sugiyama
Mean shift clustering finds the modes of the data probability density by identifying the zero points of the density gradient. Since it does not require to fix the number of clusters in advance, the mean shift has been a popular clustering algorithm in various application fields. A typical implementation of the mean shift is to first estimate the density by kernel density estimation and then compute its gradient. However, since good density estimation does not necessarily imply accurate estimation of the density gradient, such an indirect two-step approach is not reliable. In this paper, we propose a method to directly estimate the gradient of the log-density without going through density estimation. The proposed method gives the global solution analytically and thus is computationally efficient. We then develop a mean-shift-like fixed-point algorithm to find the modes of the density for clustering. As in the mean shift, one does not need to set the number of clusters in advance. We empirically show that the proposed clustering method works much better than the mean shift especially for high-dimensional data. Experimental results further indicate that the proposed method outperforms existing clustering methods.