Akifumi Okuno

ML
h-index6
17papers
43citations
Novelty52%
AI Score35

17 Papers

MEApr 18, 2022
A Greedy and Optimistic Approach to Clustering with a Specified Uncertainty of Covariates

Akifumi Okuno, Kohei Hattori

In this study, we examine a clustering problem in which the covariates of each individual element in a dataset are associated with an uncertainty specific to that element. More specifically, we consider a clustering approach in which a pre-processing applying a non-linear transformation to the covariates is used to capture the hidden data structure. To this end, we approximate the sets representing the propagated uncertainty for the pre-processed features empirically. To exploit the empirical uncertainty sets, we propose a greedy and optimistic clustering (GOC) algorithm that finds better feature candidates over such sets, yielding more condensed clusters. As an important application, we apply the GOC algorithm to synthetic datasets of the orbital properties of stars generated through our numerical simulation mimicking the formation process of the Milky Way. The GOC algorithm demonstrates an improved performance in finding sibling stars originating from the same dwarf galaxy. These realistic datasets have also been made publicly available.

MEJun 28, 2023
Autoregressive with Slack Time Series Model for Forecasting a Partially-Observed Dynamical Time Series

Akifumi Okuno, Yuya Morishita, Yoh-ichi Mototake

This study delves into the domain of dynamical systems, specifically the forecasting of dynamical time series defined through an evolution function. Traditional approaches in this area predict the future behavior of dynamical systems by inferring the evolution function. However, these methods may confront obstacles due to the presence of missing variables, which are usually attributed to challenges in measurement and a partial understanding of the system of interest. To overcome this obstacle, we introduce the autoregressive with slack time series (ARS) model, that simultaneously estimates the evolution function and imputes missing variables as a slack time series. Assuming time-invariance and linearity in the (underlying) entire dynamical time series, our experiments demonstrate the ARS model's capability to forecast future time series. From a theoretical perspective, we prove that a 2-dimensional time-invariant and linear system can be reconstructed by utilizing observations from a single, partially observed dimension of the system.

MEMar 31, 2023
An interpretable neural network-based non-proportional odds model for ordinal regression

Akifumi Okuno, Kazuharu Harada

This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is defined for both continuous and discrete responses, whereas standard methods typically treat the ordered continuous variables as if they are discrete, (2) instead of estimating response-dependent finite-dimensional coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets.

MEJul 15, 2024
An integrated perspective of robustness in regression through the lens of the bias-variance trade-off

Akifumi Okuno

This paper presents an integrated perspective on robustness in regression. Specifically, we examine the relationship between traditional outlier-resistant robust estimation and robust optimization, which focuses on parameter estimation resistant to imaginary dataset-perturbations. While both are commonly regarded as robust methods, these concepts demonstrate a bias-variance trade-off, indicating that they follow roughly converse strategies.

MEAug 4, 2023
Outlier-robust neural network training: variation regularization meets trimmed loss to prevent functional breakdown

Akifumi Okuno, Shotaro Yagishita

In this study, we tackle the challenge of outlier-robust predictive modeling using highly expressive neural networks. Our approach integrates two key components: (1) a transformed trimmed loss (TTL), a computationally efficient variant of the classical trimmed loss, and (2) higher-order variation regularization (HOVR), which imposes smoothness constraints on the prediction function. While traditional robust statistics typically assume low-complexity models such as linear and kernel models, applying TTL alone to modern neural networks may fail to ensure robustness, as their high expressive power allows them to fit both inliers and outliers, even when a robust loss is used. To address this, we revisit the traditional notion of breakdown point and adapt it to the nonlinear function setting, introducing a regularization scheme via HOVR that controls the model's capacity and suppresses overfitting to outliers. We theoretically establish that our training procedure retains a high functional breakdown point, thereby ensuring robustness to outlier contamination. We develop a stochastic optimization algorithm tailored to this framework and provide a theoretical guarantee of its convergence.

MLAug 25, 2025
Algebraic Approach to Ridge-Regularized Mean Squared Error Minimization in Minimal ReLU Neural Network

Ryoya Fukasaku, Yutaro Kabata, Akifumi Okuno

This paper investigates a perceptron, a simple neural network model, with ReLU activation and a ridge-regularized mean squared error (RR-MSE). Our approach leverages the fact that the RR-MSE for ReLU perceptron is piecewise polynomial, enabling a systematic analysis using tools from computational algebra. In particular, we develop a Divide-Enumerate-Merge strategy that exhaustively enumerates all local minima of the RR-MSE. By virtue of the algebraic formulation, our approach can identify not only the typical zero-dimensional minima (i.e., isolated points) obtained by numerical optimization, but also higher-dimensional minima (i.e., connected sets such as curves, surfaces, or hypersurfaces). Although computational algebraic methods are computationally very intensive for perceptrons of practical size, as a proof of concept, we apply the proposed approach in practice to minimal perceptrons with a few hidden units.

MLDec 28, 2021
Improving Nonparametric Classification via Local Radial Regression with an Application to Stock Prediction

Ruixing Cao, Akifumi Okuno, Kei Nakagawa et al.

For supervised classification problems, this paper considers estimating the query's label probability through local regression using observed covariates. Well-known nonparametric kernel smoother and $k$-nearest neighbor ($k$-NN) estimator, which take label average over a ball around the query, are consistent but asymptotically biased particularly for a large radius of the ball. To eradicate such bias, local polynomial regression (LPoR) and multiscale $k$-NN (MS-$k$-NN) learn the bias term by local regression around the query and extrapolate it to the query itself. However, their theoretical optimality has been shown for the limit of the infinite number of training samples. For correcting the asymptotic bias with fewer observations, this paper proposes a \emph{local radial regression (LRR)} and its logistic regression variant called \emph{local radial logistic regression~(LRLR)}, by combining the advantages of LPoR and MS-$k$-NN. The idea is quite simple: we fit the local regression to observed labels by taking only the radial distance as the explanatory variable and then extrapolate the estimated label probability to zero distance. The usefulness of the proposed method is shown theoretically and experimentally. We prove the convergence rate of the $L^2$ risk for LRR with reference to MS-$k$-NN, and our numerical experiments, including real-world datasets of daily stock indices, demonstrate that LRLR outperforms LPoR and MS-$k$-NN.

MLDec 7, 2021
A generalization gap estimation for overparameterized models via the Langevin functional variance

Akifumi Okuno, Keisuke Yano

This paper discusses the estimation of the generalization gap, the difference between generalization performance and training performance, for overparameterized models including neural networks. We first show that a functional variance, a key concept in defining a widely-applicable information criterion, characterizes the generalization gap even in overparameterized settings where a conventional theory cannot be applied. As the computational cost of the functional variance is expensive for the overparameterized models, we propose an efficient approximation of the function variance, the Langevin approximation of the functional variance (Langevin FV). This method leverages only the $1$st-order gradient of the squared loss function, without referencing the $2$nd-order gradient; this ensures that the computation is efficient and the implementation is consistent with gradient-based optimization algorithms. We demonstrate the Langevin FV numerically by estimating the generalization gaps of overparameterized linear regression and non-linear neural network models, containing more than a thousand of parameters therein.

STDec 1, 2021
Minimax Analysis for Inverse Risk in Nonparametric Planer Invertible Regression

Akifumi Okuno, Masaaki Imaizumi

We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis of the efficiency of these methods is still under development. In this study, we study a minimax risk for estimating invertible bi-Lipschitz functions on a square in a $2$-dimensional plane. We first introduce two types of $L^2$-risks to evaluate an estimator which preserves invertibility. Then, we derive lower and upper rates for minimax values for the risks associated with inverse functions. For the derivation, we exploit a representation of invertible functions using level-sets. Specifically, to obtain the upper rate, we develop an estimator asymptotically almost everywhere invertible, whose risk attains the derived minimax lower rate up to logarithmic factors. The derived minimax rate corresponds to that of the non-invertible bi-Lipschitz function, which shows that the invertibility does not reduce the complexity of the estimation problem in terms of the rate. % the minimax rate, similar to other shape constraints.

LGMay 2, 2020
Stochastic Neighbor Embedding of Multimodal Relational Data for Image-Text Simultaneous Visualization

Morihiro Mizutani, Akifumi Okuno, Geewook Kim et al.

Multimodal relational data analysis has become of increasing importance in recent years, for exploring across different domains of data, such as images and their text tags obtained from social networking services (e.g., Flickr). A variety of data analysis methods have been developed for visualization; to give an example, t-Stochastic Neighbor Embedding (t-SNE) computes low-dimensional feature vectors so that their similarities keep those of the observed data vectors. However, t-SNE is designed only for a single domain of data but not for multimodal data; this paper aims at visualizing multimodal relational data consisting of data vectors in multiple domains with relations across these vectors. By extending t-SNE, we herein propose Multimodal Relational Stochastic Neighbor Embedding (MR-SNE), that (1) first computes augmented relations, where we observe the relations across domains and compute those within each of domains via the observed data vectors, and (2) jointly embeds the augmented relations to a low-dimensional space. Through visualization of Flickr and Animal with Attributes 2 datasets, proposed MR-SNE is compared with other graph embedding-based approaches; MR-SNE demonstrates the promising performance.

MLFeb 8, 2020
Extrapolation Towards Imaginary $0$-Nearest Neighbour and Its Improved Convergence Rate

Akifumi Okuno, Hidetoshi Shimodaira

$k$-nearest neighbour ($k$-NN) is one of the simplest and most widely-used methods for supervised classification, that predicts a query's label by taking weighted ratio of observed labels of $k$ objects nearest to the query. The weights and the parameter $k \in \mathbb{N}$ regulate its bias-variance trade-off, and the trade-off implicitly affects the convergence rate of the excess risk for the $k$-NN classifier; several existing studies considered selecting optimal $k$ and weights to obtain faster convergence rate. Whereas $k$-NN with non-negative weights has been developed widely, it was also proved that negative weights are essential for eradicating the bias terms and attaining optimal convergence rate. In this paper, we propose a novel multiscale $k$-NN (MS-$k$-NN), that extrapolates unweighted $k$-NN estimators from several $k \ge 1$ values to $k=0$, thus giving an imaginary 0-NN estimator. Our method implicitly computes optimal real-valued weights that are adaptive to the query and its neighbour points. We theoretically prove that the MS-$k$-NN attains the improved rate, which coincides with the existing optimal rate under some conditions.

SIJul 22, 2019
Hyperlink Regression via Bregman Divergence

Akifumi Okuno, Hidetoshi Shimodaira

A collection of $U \: (\in \mathbb{N})$ data vectors is called a $U$-tuple, and the association strength among the vectors of a tuple is termed as the \emph{hyperlink weight}, that is assumed to be symmetric with respect to permutation of the entries in the index. We herein propose Bregman hyperlink regression (BHLR), which learns a user-specified symmetric similarity function such that it predicts the tuple's hyperlink weight from data vectors stored in the $U$-tuple. BHLR is a simple and general framework for hyper-relational learning, that minimizes Bregman-divergence (BD) between the hyperlink weights and estimated similarities defined for the corresponding tuples; BHLR encompasses various existing methods, such as logistic regression ($U=1$), Poisson regression ($U=1$), link prediction ($U=2$), and those for representation learning, such as graph embedding ($U=2$), matrix factorization ($U=2$), tensor factorization ($U \geq 2$), and their variants equipped with arbitrary BD. Nonlinear functions (e.g., neural networks), can be employed for the similarity functions. However, there are theoretical challenges such that some of different tuples of BHLR may share data vectors therein, unlike the i.i.d. setting of classical regression. We address these theoretical issues, and proved that BHLR equipped with arbitrary BD and $U \in \mathbb{N}$ is (P-1) statistically consistent, that is, it asymptotically recovers the underlying true conditional expectation of hyperlink weights given data vectors, and (P-2) computationally tractable, that is, it is efficiently computed by stochastic optimization algorithms using a novel generalized minibatch sampling procedure for hyper-relational data. Consequently, theoretical guarantees for BHLR including several existing methods, that have been examined experimentally, are provided in a unified manner.

LGFeb 27, 2019
Representation Learning with Weighted Inner Product for Universal Approximation of General Similarities

Geewook Kim, Akifumi Okuno, Kazuki Fukui et al.

We propose $\textit{weighted inner product similarity}$ (WIPS) for neural network-based graph embedding. In addition to the parameters of neural networks, we optimize the weights of the inner product by allowing positive and negative values. Despite its simplicity, WIPS can approximate arbitrary general similarities including positive definite, conditionally positive definite, and indefinite kernels. WIPS is free from similarity model selection, since it can learn any similarity models such as cosine similarity, negative Poincaré distance and negative Wasserstein distance. Our experiments show that the proposed method can learn high-quality distributed representations of nodes from real datasets, leading to an accurate approximation of similarities as well as high performance in inductive tasks.

MLFeb 22, 2019
Robust Graph Embedding with Noisy Link Weights

Akifumi Okuno, Hidetoshi Shimodaira

We propose $β$-graph embedding for robustly learning feature vectors from data vectors and noisy link weights. A newly introduced empirical moment $β$-score reduces the influence of contamination and robustly measures the difference between the underlying correct expected weights of links and the specified generative model. The proposed method is computationally tractable; we employ a minibatch-based efficient stochastic algorithm and prove that this algorithm locally minimizes the empirical moment $β$-score. We conduct numerical experiments on synthetic and real-world datasets.

MLOct 4, 2018
Graph Embedding with Shifted Inner Product Similarity and Its Improved Approximation Capability

Akifumi Okuno, Geewook Kim, Hidetoshi Shimodaira

We propose shifted inner-product similarity (SIPS), which is a novel yet very simple extension of the ordinary inner-product similarity (IPS) for neural-network based graph embedding (GE). In contrast to IPS, that is limited to approximating positive-definite (PD) similarities, SIPS goes beyond the limitation by introducing bias terms in IPS; we theoretically prove that SIPS is capable of approximating not only PD but also conditionally PD (CPD) similarities with many examples such as cosine similarity, negative Poincare distance and negative Wasserstein distance. Since SIPS with sufficiently large neural networks learns a variety of similarities, SIPS alleviates the need for configuring the similarity function of GE. Approximation error rate is also evaluated, and experiments on two real-world datasets demonstrate that graph embedding using SIPS indeed outperforms existing methods.

MLMay 31, 2018
On representation power of neural network-based graph embedding and beyond

Akifumi Okuno, Hidetoshi Shimodaira

We consider the representation power of siamese-style similarity functions used in neural network-based graph embedding. The inner product similarity (IPS) with feature vectors computed via neural networks is commonly used for representing the strength of association between two nodes. However, only a little work has been done on the representation capability of IPS. A very recent work shed light on the nature of IPS and reveals that IPS has the capability of approximating any positive definite (PD) similarities. However, a simple example demonstrates the fundamental limitation of IPS to approximate non-PD similarities. We then propose a novel model named Shifted IPS (SIPS) that approximates any Conditionally PD (CPD) similarities arbitrary well. CPD is a generalization of PD with many examples such as negative Poincaré distance and negative Wasserstein distance, thus SIPS has a potential impact to significantly improve the applicability of graph embedding without taking great care in configuring the similarity function. Our numerical experiments demonstrate the SIPS's superiority over IPS. In theory, we further extend SIPS beyond CPD by considering the inner product in Minkowski space so that it approximates more general similarities.

MLFeb 13, 2018
A probabilistic framework for multi-view feature learning with many-to-many associations via neural networks

Akifumi Okuno, Tetsuya Hada, Hidetoshi Shimodaira

A simple framework Probabilistic Multi-view Graph Embedding (PMvGE) is proposed for multi-view feature learning with many-to-many associations so that it generalizes various existing multi-view methods. PMvGE is a probabilistic model for predicting new associations via graph embedding of the nodes of data vectors with links of their associations. Multi-view data vectors with many-to-many associations are transformed by neural networks to feature vectors in a shared space, and the probability of new association between two data vectors is modeled by the inner product of their feature vectors. While existing multi-view feature learning techniques can treat only either of many-to-many association or non-linear transformation, PMvGE can treat both simultaneously. By combining Mercer's theorem and the universal approximation theorem, we prove that PMvGE learns a wide class of similarity measures across views. Our likelihood-based estimator enables efficient computation of non-linear transformations of data vectors in large-scale datasets by minibatch SGD, and numerical experiments illustrate that PMvGE outperforms existing multi-view methods.