Masahiro Nomura

LG
Semantic Scholar Profile
h-index16
20papers
167citations
Novelty56%
AI Score57

20 Papers

LGFeb 3, 2023
Towards Practical Preferential Bayesian Optimization with Skew Gaussian Processes

Shion Takeno, Masahiro Nomura, Masayuki Karasuyama

We study preferential Bayesian optimization (BO) where reliable feedback is limited to pairwise comparison called duels. An important challenge in preferential BO, which uses the preferential Gaussian process (GP) model to represent flexible preference structure, is that the posterior distribution is a computationally intractable skew GP. The most widely used approach for preferential BO is Gaussian approximation, which ignores the skewness of the true posterior. Alternatively, Markov chain Monte Carlo (MCMC) based preferential BO is also proposed. In this work, we first verify the accuracy of Gaussian approximation, from which we reveal the critical problem that the predictive probability of duels can be inaccurate. This observation motivates us to improve the MCMC-based estimation for skew GP, for which we show the practical efficiency of Gibbs sampling and derive the low variance MC estimator. However, the computational time of MCMC can still be a bottleneck in practice. Towards building a more practical preferential BO, we develop a new method that achieves both high computational efficiency and low sample complexity, and then demonstrate its effectiveness through extensive numerical experiments.

MLAug 20, 2024
Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits

Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto et al.

We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the ''main effect'' derived from the main actions, and the ''residual effect'', originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB's superior performance over typical methods in both OPE and OPL.

42.1NEMay 18
Adaptive Stochastic Natural Gradient Method for Safe Optimization on Binary Space

Kento Uchida, Ryoki Hamano, Masahiro Nomura et al.

Optimization problems in real-world applications across the medical and engineering domains often involve potential risks when evaluating candidate solutions. Safe optimization aims to perform optimization while suppressing unsafe solution evaluations in such situations. For continuous search spaces, there exist safe optimization methods based on evolutionary computation. However, the algorithm development of safe optimization methods for binary search spaces has not been adequately addressed. In this study, we incorporate additional mechanisms for safe optimization into a binary optimization method, the adaptive stochastic natural gradient method (ASNG) with a family of Bernoulli distributions. For safety functions that must be kept non-negative during optimization, the proposed method, safe ASNG, estimates the Lipschitz constants with respect to the Hamming distance by constructing surrogate models of safety functions based on discrete Walsh functions. Then, safe ASNG computes a safe region that consists of safe solutions around the previously evaluated safe solutions. By projecting newly generated solutions to their nearest neighbors within the safe region, safe ASNG suppresses unsafe solution evaluations. Experimental results on benchmark problems on binary domains confirm that, while the comparative methods fail to suppress unsafe solution evaluations, safe ASNG achieves efficient optimization while effectively suppressing unsafe solution evaluations.

5.5LGApr 19
On the Generalization Bounds of Symbolic Regression with Genetic Programming

Masahiro Nomura, Ryoki Hamano, Isao Ono

Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.

21.7NEMay 15
Diversified Residual Symbolic Regression

Koki Ikeda, Masahiro Nomura, Ryoki Hamano

Symbolic regression (SR) aims to discover explicit mathematical expressions that explain observed data and is widely used in domains where interpretability is essential. Because interpretability requires expressions to reflect meaningful regularities, SR is sensitive to observations that deviate from the dominant relationship. Such irregular observations, or outliers, are common in real-world data and can hinder SR from identifying underlying regularities. Robust regression mitigates this by downweighting observations with large residuals. However, deciding which observations should be treated as outliers is often ambiguous and depends on user interpretation and domain knowledge, a perspective largely overlooked in existing SR studies. This motivates approaches that present multiple candidate expressions, allowing users to examine different residual patterns and choose expressions consistent with their expertise. We propose diversified residual symbolic regression (DRSR), which achieves high predictive accuracy while promoting diversity with respect to residual patterns based on the Quality-Diversity paradigm. DRSR collects multiple expressions that fit the data well but differ in how residuals are distributed, enabling post-search selection aligned with domain knowledge. On a synthetic mixture dataset, DRSR produces more diverse expressions than conventional SR while capturing multiple underlying relationships. On a real-world astronomical dataset, DRSR discovers multiple expressions consistent with known physical relationships.

LGFeb 17
Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching

Ren Kishimoto, Rikiya Takehi, Koichi Tanaka et al.

On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user's profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.

LGFeb 26
Takeuchi's Information Criteria as Generalization Measures for DNNs Close to NTK Regime

Hiroki Naganuma, Taiji Suzuki, Rio Yokota et al.

Generalization measures have been studied extensively in the machine learning community to better characterize generalization gaps. However, establishing a reliable generalization measure for statistically singular models such as deep neural networks (DNNs) is difficult due to their complex nature. This study focuses on Takeuchi's information criterion (TIC) to investigate the conditions under which this classical measure can effectively explain the generalization gaps of DNNs. Importantly, the developed theory indicates the applicability of TIC near the neural tangent kernel (NTK) regime. In a series of experiments, we trained more than 5,000 DNN models with 12 architectures, including large models (e.g., VGG-16), on four datasets, and estimated the corresponding TIC values to examine the relationship between the generalization gap and the TIC estimates. We applied several TIC approximation methods with feasible computational costs and assessed the accuracy trade-off. Our experimental results indicate that the estimated TIC values correlate well with the generalization gap under conditions close to the NTK regime. However, we show both theoretically and empirically that outside the NTK regime such correlation disappears. Finally, we demonstrate that TIC provides better trial pruning ability than existing methods for hyperparameter optimization.

CVMar 9, 2025Code
Instance-wise Supervision-level Optimization in Active Learning

Shinnosuke Matsuo, Riku Togashi, Ryoma Bise et al.

Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost. This code is available at https://github.com/matsuo-shinnosuke/ISOAL.

MLFeb 3, 2024
Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Haruka Kiyohara, Masahiro Nomura, Yuta Saito

We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.

LGApr 23, 2024
Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It

Yuta Saito, Masahiro Nomura

There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the effectiveness of counterfactual policies based on biased logged data. However, there are many cases where those estimators are used not only to evaluate the value of decision making policies but also to search for the best hyperparameters from a large candidate space. This work explores the latter hyperparameter optimization (HPO) task for off-policy learning. We empirically show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure, merely pursuing hyperparameters whose generalization performance is greatly overestimated. We then propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously. Empirical investigations demonstrate the effectiveness of our proposed HPO algorithm in situations where the typical procedure fails severely.

NEApr 30, 2025
A Memetic Algorithm based on Variational Autoencoder for Black-Box Discrete Optimization with Epistasis among Parameters

Aoi Kato, Kenta Kojima, Masahiro Nomura et al.

Black-box discrete optimization (BB-DO) problems arise in many real-world applications, such as neural architecture search and mathematical model estimation. A key challenge in BB-DO is epistasis among parameters where multiple variables must be modified simultaneously to effectively improve the objective function. Estimation of Distribution Algorithms (EDAs) provide a powerful framework for tackling BB-DO problems. In particular, an EDA leveraging a Variational Autoencoder (VAE) has demonstrated strong performance on relatively low-dimensional problems with epistasis while reducing computational cost. Meanwhile, evolutionary algorithms such as DSMGA-II and P3, which integrate bit-flip-based local search with linkage learning, have shown excellent performance on high-dimensional problems. In this study, we propose a new memetic algorithm that combines VAE-based sampling with local search. The proposed method inherits the strengths of both VAE-based EDAs and local search-based approaches: it effectively handles high-dimensional problems with epistasis among parameters without incurring excessive computational overhead. Experiments on NK landscapes -- a challenging benchmark for BB-DO involving epistasis among parameters -- demonstrate that our method outperforms state-of-the-art VAE-based EDA methods, as well as leading approaches such as P3 and DSMGA-II.

NEJan 27, 2022
Fast Moving Natural Evolution Strategy for High-Dimensional Problems

Masahiro Nomura, Isao Ono

In this work, we propose a new variant of natural evolution strategies (NES) for high-dimensional black-box optimization problems. The proposed method, CR-FM-NES, extends a recently proposed state-of-the-art NES, Fast Moving Natural Evolution Strategy (FM-NES), in order to be applicable in high-dimensional problems. CR-FM-NES builds on an idea using a restricted representation of a covariance matrix instead of using a full covariance matrix, while inheriting an efficiency of FM-NES. The restricted representation of the covariance matrix enables CR-FM-NES to update parameters of a multivariate normal distribution in linear time and space complexity, which can be applied to high-dimensional problems. Our experimental results reveal that CR-FM-NES does not lose the efficiency of FM-NES, and on the contrary, CR-FM-NES has achieved significant speedup compared to FM-NES on some benchmark problems. Furthermore, our numerical experiments using 200, 600, and 1000-dimensional benchmark problems demonstrate that CR-FM-NES is effective over scalable baseline methods, VD-CMA and Sep-CMA.

MLJan 12, 2022
Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap

Masahiro Kato, Kaito Ariu, Masaaki Imaizumi et al.

We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First, we review a lower bound derived by Kaufmann et al. (2016). Then, we propose the "Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)" strategy, which consists of the sampling rule using the Neyman allocation with an estimated standard deviation and the recommendation rule using an AIPW estimator. Our proposed strategy is optimal because the upper bound matches the lower bound when the budget goes to infinity and the gap goes to zero.

NENov 22, 2021
Towards a Principled Learning Rate Adaptation for Natural Evolution Strategies

Masahiro Nomura, Isao Ono

Natural Evolution Strategies (NES) is a promising framework for black-box continuous optimization problems. NES optimizes the parameters of a probability distribution based on the estimated natural gradient, and one of the key parameters affecting the performance is the learning rate. We argue that from the viewpoint of the natural gradient method, the learning rate should be determined according to the estimation accuracy of the natural gradient. To do so, we propose a new learning rate adaptation mechanism for NES. The proposed mechanism makes it possible to set a high learning rate for problems that are relatively easy to optimize, which results in speeding up the search. On the other hand, in problems that are difficult to optimize (e.g., multimodal functions), the proposed mechanism makes it possible to set a conservative learning rate when the estimation accuracy of the natural gradient seems to be low, which results in the robust and stable search. The experimental evaluations on unimodal and multimodal functions demonstrate that the proposed mechanism works properly depending on a search situation and is effective over the existing method, i.e., using the fixed learning rate.

NEAug 21, 2021
Natural Evolution Strategy for Unconstrained and Implicitly Constrained Problems with Ridge Structure

Masahiro Nomura, Isao Ono

In this paper, we propose a new natural evolution strategy for unconstrained black-box function optimization (BBFO) problems and implicitly constrained BBFO problems. BBFO problems are known to be difficult because explicit representations of objective functions are not available. Implicit constraints make the problems more difficult because whether or not a solution is feasible is revealed when the solution is evaluated with the objective function. DX-NES-IC is one of the promising methods for implicitly constrained BBFO problems. DX-NES-IC has shown better performance than conventional methods on implicitly constrained benchmark problems. However, DX-NES-IC has a problem in that the moving speed of the probability distribution is slow on ridge structure. To address the problem, we propose the Fast Moving Natural Evolution Strategy (FM-NES) that accelerates the movement of the probability distribution on ridge structure by introducing the rank-one update into DX-NES-IC. The rank-one update is utilized in CMA-ES. Since naively introducing the rank-one update makes the search performance deteriorate on implicitly constrained problems, we propose a condition of performing the rank-one update. We also propose to reset the shape of the probability distribution when an infeasible solution is sampled at the first time. In numerical experiments using unconstrained and implicitly constrained benchmark problems, FM-NES showed better performance than DX-NES-IC on problems with ridge structure and almost the same performance as DX-NES-IC on the others. Furthermore, FM-NES outperformed xNES, CMA-ES, xNES with the resampling technique, and CMA-ES with the resampling technique.

LGDec 13, 2020
Warm Starting CMA-ES for Hyperparameter Optimization

Masahiro Nomura, Shuhei Watanabe, Youhei Akimoto et al.

Hyperparameter optimization (HPO), formulated as black-box optimization (BBO), is recognized as essential for automation and high performance of machine learning approaches. The CMA-ES is a promising BBO approach with a high degree of parallelism, and has been applied to HPO tasks, often under parallel implementation, and shown superior performance to other approaches including Bayesian optimization (BO). However, if the budget of hyperparameter evaluations is severely limited, which is often the case for end users who do not deserve parallel computing, the CMA-ES exhausts the budget without improving the performance due to its long adaptation phase, resulting in being outperformed by BO approaches. To address this issue, we propose to transfer prior knowledge on similar HPO tasks through the initialization of the CMA-ES, leading to significantly shortening the adaptation time. The knowledge transfer is designed based on the novel definition of task similarity, with which the correlation of the performance of the proposed approach is confirmed on synthetic problems. The proposed warm starting CMA-ES, called WS-CMA-ES, is applied to different HPO tasks where some prior knowledge is available, showing its superior performance over the original CMA-ES as well as BO approaches with or without using the prior knowledge.

MLJun 24, 2020
Simple and Scalable Parallelized Bayesian Optimization

Masahiro Nomura

In recent years, leveraging parallel and distributed computational resources has become essential to solve problems of high computational cost. Bayesian optimization (BO) has shown attractive results in those expensive-to-evaluate problems such as hyperparameter optimization of machine learning algorithms. While many parallel BO methods have been developed to search efficiently utilizing these computational resources, these methods assumed synchronous settings or were not scalable. In this paper, we propose a simple and scalable BO method for asynchronous parallel settings. Experiments are carried out with a benchmark function and hyperparameter optimization of multi-layer perceptrons, which demonstrate the promising performance of the proposed method.

LGJun 18, 2020
Efficient Hyperparameter Optimization under Multi-Source Covariate Shift

Masahiro Nomura, Yuta Saito

A typical assumption in supervised machine learning is that the train (source) and test (target) datasets follow completely the same distribution. This assumption is, however, often violated in uncertain real-world applications, which motivates the study of learning under covariate shift. In this setting, the naive use of adaptive hyperparameter optimization methods such as Bayesian optimization does not work as desired since it does not address the distributional shift among different datasets. In this work, we consider a novel hyperparameter optimization problem under the multi-source covariate shift whose goal is to find the optimal hyperparameters for a target task of interest using only unlabeled data in a target task and labeled data in multiple source tasks. To conduct efficient hyperparameter optimization for the target task, it is essential to estimate the target objective using only the available information. To this end, we construct the variance reduced estimator that unbiasedly approximates the target objective with a desirable variance property. Building on the proposed estimator, we provide a general and tractable hyperparameter optimization procedure, which works preferably in our setting with a no-regret guarantee. The experiments demonstrate that the proposed framework broadens the applications of automated hyperparameter optimization.

MLNov 18, 2019
A Simple Heuristic for Bayesian Optimization with A Low Budget

Masahiro Nomura, Kenshi Abe

The aim of black-box optimization is to optimize an objective function within the constraints of a given evaluation budget. In this problem, it is generally assumed that the computational cost for evaluating a point is large; thus, it is important to search efficiently with as low budget as possible. Bayesian optimization is an efficient method for black-box optimization and provides exploration-exploitation trade-off by constructing a surrogate model that considers uncertainty of the objective function. However, because Bayesian optimization should construct the surrogate model for the entire search space, it does not exhibit good performance when points are not sampled sufficiently. In this study, we develop a heuristic method refining the search space for Bayesian optimization when the available evaluation budget is low. The proposed method refines a promising region by dividing the original region so that Bayesian optimization can be executed with the promising region as the initial search space. We confirm that Bayesian optimization with the proposed method outperforms Bayesian optimization alone and shows equal or better performance to two search-space division algorithms through experiments on the benchmark functions and the hyperparameter optimization of machine learning algorithms.

MLOct 16, 2019
Towards Resolving Propensity Contradiction in Offline Recommender Learning

Yuta Saito, Masahiro Nomura

We study offline recommender learning from explicit rating feedback in the presence of selection bias. A current promising solution for the bias is the inverse propensity score (IPS) estimation. However, the performance of existing propensity-based methods can suffer significantly from the propensity estimation bias. In fact, most of the previous IPS-based methods require some amount of missing-completely-at-random (MCAR) data to accurately estimate the propensity. This leads to a critical self-contradiction; IPS is ineffective without MCAR data, even though it originally aims to learn recommenders from only missing-not-at-random feedback. To resolve this propensity contradiction, we derive a propensity-independent generalization error bound and propose a novel algorithm to minimize the theoretical bound via adversarial learning. Our theory and algorithm do not require a propensity estimation procedure, thereby leading to a well-performing rating predictor without the true propensity information. Extensive experiments demonstrate that the proposed approach is superior to a range of existing methods both in rating prediction and ranking metrics in practical settings without MCAR data.