MLSep 3, 2022
Geometry of EM and related iterative algorithmsHideitsu Hino, Shotaro Akaho, Noboru Murata
The Expectation--Maximization (EM) algorithm is a simple meta-algorithm that has been used for many years as a methodology for statistical inference when there are missing measurements in the observed data or when the data is composed of observables and unobservables. Its general properties are well studied, and also, there are countless ways to apply it to individual problems. In this paper, we introduce the $em$ algorithm, an information geometric formulation of the EM algorithm, and its extensions and applications to various problems. Specifically, we will see that it is possible to formulate an outlier-robust inference algorithm, an algorithm for calculating channel capacity, parameter estimation methods on probability simplex, particular multivariate analysis methods such as principal component analysis in a space of probability models and modal regression, matrix factorization, and learning generative models, which have recently attracted attention in deep learning, from the geometric perspective.
LGDec 22, 2025
Lag Operator SSMs: A Geometric Framework for Structured State Space ModelingSutashu Tomonaga, Kenji Doya, Noboru Murata
Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system's basis functions "slide" and change from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, offering a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our approach, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.
LGJul 24, 2025
Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden WeightsJun'ichi Takeuchi, Yoshinari Takeishi, Noboru Murata et al.
Fisher information matrices and neural tangent kernels (NTK) for 2-layer ReLU networks with random hidden weight are argued. We discuss the relation between both notions as a linear transformation and show that spectral decomposition of NTK with concrete forms of eigenfunctions with major eigenvalues. We also obtain an approximation formula of the functions presented by the 2-layer neural networks.
CVJan 7, 2020
Fast and robust multiplane single molecule localization microscopy using deep neural networkToshimitsu Aritake, Hideitsu Hino, Shigeyuki Namiki et al.
Single molecule localization microscopy is widely used in biological research for measuring the nanostructures of samples smaller than the diffraction limit. This study uses multifocal plane microscopy and addresses the 3D single molecule localization problem, where lateral and axial locations of molecules are estimated. However, when we multifocal plane microscopy is used, the estimation accuracy of 3D localization is easily deteriorated by the small lateral drifts of camera positions. We formulate a 3D molecule localization problem along with the estimation of the lateral drifts as a compressed sensing problem, A deep neural network was applied to accurately and efficiently solve this problem. The proposed method is robust to the lateral drifts and achieves an accuracy of 20 nm laterally and 50 nm axially without an explicit drift correction.
LGSep 27, 2019
On a convergence property of a geometrical algorithm for statistical manifoldsShotaro Akaho, Hideitsu Hino, Noboru Murata
In this paper, we examine a geometrical projection algorithm for statistical inference. The algorithm is based on Pythagorean relation and it is derivative-free as well as representation-free that is useful in nonparametric cases. We derive a bound of learning rate to guarantee local convergence. In special cases of m-mixture and e-mixture estimation problems, we calculate specific forms of the bound that can be used easily in practice.
MLMay 19, 2018
The global optimum of shallow neural network is attained by ridgelet transformSho Sonoda, Isao Ishikawa, Masahiro Ikeda et al.
We prove that the global minimum of the backpropagation (BP) training problem of neural networks with an arbitrary nonlinear activation is given by the ridgelet transform. A series of computational experiments show that there exists an interesting similarity between the scatter plot of hidden parameters in a shallow neural network after the BP training and the spectrum of the ridgelet transform. By introducing a continuous model of neural networks, we reduce the training problem to a convex optimization in an infinite dimensional Hilbert space, and obtain the explicit expression of the global optimizer via the ridgelet transform.
LGDec 12, 2017
Transportation analysis of denoising autoencoders: a novel method for analyzing deep neural networksSho Sonoda, Noboru Murata
The feature map obtained from the denoising autoencoder (DAE) is investigated by determining transportation dynamics of the DAE, which is a cornerstone for deep learning. Despite the rapid development in its application, deep neural networks remain analytically unexplained, because the feature maps are nested and parameters are not faithful. In this paper, we address the problem of the formulation of nested complex of parameters by regarding the feature map as a transport map. Even when a feature map has different dimensions between input and output, we can regard it as a transportation map by considering that both the input and output spaces are embedded in a common high-dimensional space. In addition, the trajectory is a geometric object and thus, is independent of parameterization. In this manner, transportation can be regarded as a universal character of deep neural networks. By determining and analyzing the transportation dynamics, we can understand the behavior of a deep neural network. In this paper, we investigate a fundamental case of deep neural networks: the DAE. We derive the transport map of the DAE, and reveal that the infinitely deep DAE transports mass to decrease a certain quantity, such as entropy, of the data distribution. These results though analytically simple, shed light on the correspondence between deep neural networks and the Wasserstein gradient flows.
LGMay 10, 2016
Transport Analysis of Infinitely Deep Neural NetworkSho Sonoda, Noboru Murata
We investigated the feature map inside deep neural networks (DNNs) by tracking the transport map. We are interested in the role of depth (why do DNNs perform better than shallow models?) and the interpretation of DNNs (what do intermediate layers do?) Despite the rapid development in their application, DNNs remain analytically unexplained because the hidden layers are nested and the parameters are not faithful. Inspired by the integral representation of shallow NNs, which is the continuum limit of the width, or the hidden unit number, we developed the flow representation and transport analysis of DNNs. The flow representation is the continuum limit of the depth or the hidden layer number, and it is specified by an ordinary differential equation with a vector field. We interpret an ordinary DNN as a transport map or a Euler broken line approximation of the flow. Technically speaking, a dynamical system is a natural model for the nested feature maps. In addition, it opens a new way to the coordinate-free treatment of DNNs by avoiding the redundant parametrization of DNNs. Following Wasserstein geometry, we analyze a flow in three aspects: dynamical system, continuity equation, and Wasserstein gradient flow. A key finding is that we specified a series of transport maps of the denoising autoencoder (DAE). Starting from the shallow DAE, this paper develops three topics: the transport map of the deep DAE, the equivalence between the stacked DAE and the composition of DAEs, and the development of the double continuum limit or the integral representation of the flow representation. As partial answers to the research questions, we found that deeper DAEs converge faster and the extracted features are better; in addition, a deep Gaussian DAE transports mass to decrease the Shannon entropy of the data distribution.
CVDec 2, 2015
Double Sparse Multi-Frame Image Super ResolutionToshiyuki Kato, Hideitsu Hino, Noboru Murata
A large number of image super resolution algorithms based on the sparse coding are proposed, and some algorithms realize the multi-frame super resolution. In multi-frame super resolution based on the sparse coding, both accurate image registration and sparse coding are required. Previous study on multi-frame super resolution based on sparse coding firstly apply block matching for image registration, followed by sparse coding to enhance the image resolution. In this paper, these two problems are solved by optimizing a single objective function. The results of numerical experiments support the effectiveness of the proposed approch.
NEMay 14, 2015
Neural Network with Unbounded Activation Functions is Universal ApproximatorSho Sonoda, Noboru Murata
This paper presents an investigation of the approximation property of neural networks with unbounded activation functions, such as the rectified linear unit (ReLU), which is the new de-facto standard of deep learning. The ReLU network can be analyzed by the ridgelet transform with respect to Lizorkin distributions. By showing three reconstruction formulas by using the Fourier slice theorem, the Radon transform, and Parseval's relation, it is shown that a neural network with unbounded activation functions still satisfies the universal approximation property. As an additional consequence, the ridgelet transform, or the backprojection filter in the Radon domain, is what the network learns after backpropagation. Subject to a constructive admissibility condition, the trained network can be obtained by simply discretizing the ridgelet transform, without backpropagation. Numerical examples not only support the consistency of the admissibility condition but also imply that some non-admissible cases result in low-pass filtering.
CVFeb 17, 2014
Sparse Coding Approach for Multi-Frame Image Super ResolutionToshiyuki Kato, Hideitsu Hino, Noboru Murata
An image super-resolution method from multiple observation of low-resolution images is proposed. The method is based on sub-pixel accuracy block matching for estimating relative displacements of observed images, and sparse signal representation for estimating the corresponding high-resolution image. Relative displacements of small patches of observed low-resolution images are accurately estimated by a computationally efficient block matching method. Since the estimated displacements are also regarded as a warping component of image degradation process, the matching results are directly utilized to generate low-resolution dictionary for sparse image representation. The matching scores of the block matching are used to select a subset of low-resolution patches for reconstructing a high-resolution patch, that is, an adaptive selection of informative low-resolution images is realized. When there is only one low-resolution image, the proposed method works as a single-frame super-resolution method. The proposed method is shown to perform comparable or superior to conventional single- and multi-frame super-resolution methods through experiments using various real-world datasets.
LGDec 23, 2013
Nonparametric Weight Initialization of Neural Networks via Integral RepresentationSho Sonoda, Noboru Murata
A new initialization method for hidden parameters in a neural network is proposed. Derived from the integral representation of the neural network, a nonparametric probability distribution of hidden parameters is introduced. In this proposal, hidden parameters are initialized by samples drawn from this distribution, and output parameters are fitted by ordinary linear regression. Numerical experiments show that backpropagation with proposed initialization converges faster than uniformly random initialization. Also it is shown that the proposed method achieves enough accuracy by itself without backpropagation in some cases.