Heishiro Kanagawa

ML
h-index27
13papers
302citations
Novelty53%
AI Score47

13 Papers

MLNov 10, 2022
Controlling Moments with Kernel Stein Discrepancies

Heishiro Kanagawa, Alessandro Barp, Arthur Gretton et al.

Kernel Stein discrepancies (KSDs) measure the quality of a distributional approximation and can be computed even when the target density has an intractable normalizing constant. Notable applications include the diagnosis of approximate MCMC samplers and goodness-of-fit tests for unnormalized statistical models. The present work analyzes the convergence control properties of KSDs. We first show that standard KSDs used for weak convergence control fail to control moment convergence. To address this limitation, we next provide sufficient conditions under which alternative diffusion KSDs control both moment and weak convergence. As an immediate consequence we develop, for each $q > 0$, the first KSDs known to exactly characterize $q$-Wasserstein convergence.

MLOct 19, 2022
A kernel Stein test of goodness of fit for sequential models

Jerome Baum, Heishiro Kanagawa, Arthur Gretton

We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.

LGMay 27
Thinned Mean Field Langevin Dynamics

Zonghao Chen, Heishiro Kanagawa, François-Xavier Briol et al.

Several important learning tasks can be formulated as minimizing an entropy-regularized objective over an appropriate space of probability distributions. Mean-field Langevin dynamics (MFLD) facilitate computation in this general context, casting the minimizer as the invariant distribution of a McKean--Vlasov process, which can be numerically discretized using $N$ particles and thus simulated. However, simulating this interacting particle system has computational complexity of order $N^2$. Motivated by recent research into \emph{kernel thinning}, we propose \texttt{KT-MFLD}, in which each particle interacts only with a thinned particle coreset of size $\mathcal{O}(N^{\frac{1}{2}})$. \texttt{KT-MFLD} thus reduces the computational complexity to order $N^{\frac{3}{2}}$ while, under mild regularity conditions, achieving the same convergence guarantees (up to logarithmic factors) as MFLD. Our theoretical analysis is empirically confirmed on tasks including the training of student-teacher neural networks, quantization with maximum mean discrepancy, and computation of predictively-oriented posteriors in a post-Bayesian framework.

COMay 22, 2024
Reinforcement Learning for Adaptive MCMC

Congye Wang, Wilson Chen, Heishiro Kanagawa et al.

An informal observation, made by several authors, is that the adaptive design of a Markov transition kernel has the flavour of a reinforcement learning task. Yet, to-date it has remained unclear how to actually exploit modern reinforcement learning technologies for adaptive MCMC. The aim of this paper is to set out a general framework, called Reinforcement Learning Metropolis--Hastings, that is theoretically supported and empirically validated. Our principal focus is on learning fast-mixing Metropolis--Hastings transition kernels, which we cast as deterministic policies and optimise via a policy gradient. Control of the learning rate provably ensures conditions for ergodicity are satisfied. The methodology is used to construct a gradient-free sampler that out-performs a popular gradient-free adaptive Metropolis--Hastings algorithm on $\approx 90 \%$ of tasks in the PosteriorDB benchmark.

COJul 1, 2025
Harnessing the Power of Reinforcement Learning for Adaptive MCMC

Congye Wang, Matthew A. Fisher, Heishiro Kanagawa et al.

Sampling algorithms drive probabilistic machine learning, and recent years have seen an explosion in the diversity of tools for this task. However, the increasing sophistication of sampling algorithms is correlated with an increase in the tuning burden. There is now a greater need than ever to treat the tuning of samplers as a learning task in its own right. In a conceptual breakthrough, Wang et al (2025) formulated Metropolis-Hastings as a Markov decision process, opening up the possibility for adaptive tuning using Reinforcement Learning (RL). Their emphasis was on theoretical foundations; realising the practical benefit of Reinforcement Learning Metropolis-Hastings (RLMH) was left for subsequent work. The purpose of this paper is twofold: First, we observe the surprising result that natural choices of reward, such as the acceptance rate, or the expected squared jump distance, provide insufficient signal for training RLMH. Instead, we propose a novel reward based on the contrastive divergence, whose superior performance in the context of RLMH is demonstrated. Second, we explore the potential of RLMH and present adaptive gradient-based samplers that balance flexibility of the Markov transition kernel with learnability of the associated RL task. A comprehensive simulation study using the posteriordb benchmark supports the practical effectiveness of RLMH.

MLMay 27, 2025
Stationary MMD Points for Cubature

Zonghao Chen, Toni Karvonen, Heishiro Kanagawa et al.

Approximation of a target probability distribution using a finite set of points is a problem of fundamental importance, arising in cubature, data compression, and optimisation. Several authors have proposed to select points by minimising a maximum mean discrepancy (MMD), but the non-convexity of this objective precludes global minimisation in general. Instead, we consider \emph{stationary} points of the MMD which, in contrast to points globally minimising the MMD, can be accurately computed. Our main theoretical contribution is the (perhaps surprising) result that, for integrands in the associated reproducing kernel Hilbert space, the cubature error of stationary MMD points vanishes \emph{faster} than the MMD. Motivated by this \emph{super-convergence} property, we consider discretised gradient flows as a practical strategy for computing stationary points of the MMD, presenting a refined convergence analysis that establishes a novel non-asymptotic finite-particle error bound, which may be of independent interest.

LGJun 7, 2021
Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation

Liyuan Xu, Heishiro Kanagawa, Arthur Gretton

Proxy causal learning (PCL) is a method for estimating the causal effect of treatments on outcomes in the presence of unobserved confounding, using proxies (structured side information) for the confounder. This is achieved via two-stage regression: in the first stage, we model relations among the treatment and proxies; in the second stage, we use this model to learn the effect of treatment on the outcome, given the context provided by the proxies. PCL guarantees recovery of the true causal effect, subject to identifiability conditions. We propose a novel method for PCL, the deep feature proxy variable method (DFPV), to address the case where the proxies, treatments, and outcomes are high-dimensional and have nonlinear complex relationships, as represented by deep neural network features. We show that DFPV outperforms recent state-of-the-art PCL methods on challenging synthetic benchmarks, including settings involving high dimensional image data. Furthermore, we show that PCL can be applied to off-policy evaluation for the confounded bandit problem, in which DFPV also exhibits competitive performance.

MLAug 23, 2020
Blindness of score-based methods to isolated components and mixing proportions

Li K. Wenliang, Heishiro Kanagawa

Statistical tasks such as density estimation and approximate Bayesian inference often involve densities with unknown normalising constants. Score-based methods, including score matching, are popular techniques as they are free of normalising constants. Although these methods enjoy theoretical guarantees, a little-known fact is that they exhibit practical failure modes when the unnormalised distribution of interest has isolated components -- they cannot discover isolated components or identify the correct mixing proportions between components. We demonstrate these findings using simple distributions and present heuristic attempts to address these issues. We hope to bring the attention of theoreticians and practitioners to these issues when developing new algorithms and applications.

MLFeb 24, 2020
Testing Goodness of Fit of Conditional Density Models with Kernels

Wittawat Jitkrittum, Heishiro Kanagawa, Bernhard Schölkopf

We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function $p(y|x)$ and a joint sample, decide whether the sample is drawn from $p(y|x)r_x(x)$ for some density $r_x$. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the normalizing constant. We show that 1) our tests are consistent against any fixed alternative conditional model; 2) the statistics can be estimated easily, requiring no density estimation as an intermediate step; and 3) our second test offers an interpretable test result providing insight on where the conditional model does not fit well in the domain of the covariate. We demonstrate the interpretability of our test on a task of modeling the distribution of New York City's taxi drop-off location given a pick-up point. To our knowledge, our work is the first to propose such conditional goodness-of-fit tests that simultaneously have all these desirable properties.

MLFeb 22, 2020
Amortised Learning by Wake-Sleep

Li K. Wenliang, Theodore Moskovitz, Heishiro Kanagawa et al.

Models that employ latent variables to capture structure in observed data lie at the heart of many current unsupervised learning algorithms, but exact maximum-likelihood learning for powerful and flexible latent-variable models is almost always intractable. Thus, state-of-the-art approaches either abandon the maximum-likelihood framework entirely, or else rely on a variety of variational approximations to the posterior distribution over the latents. Here, we propose an alternative approach that we call amortised learning. Rather than computing an approximation to the posterior over latents, we use a wake-sleep Monte-Carlo strategy to learn a function that directly estimates the maximum-likelihood parameter updates. Amortised learning is possible whenever samples of latents and observations can be simulated from the generative model, treating the model as a "black box". We demonstrate its effectiveness on a wide range of complex models, including those with latents that are discrete or supported on non-Euclidean spaces.

MLJul 1, 2019
A Kernel Stein Test for Comparing Latent Variable Models

Heishiro Kanagawa, Wittawat Jitkrittum, Lester Mackey et al.

We propose a kernel-based nonparametric test of relative goodness of fit, where the goal is to compare two models, both of which may have unobserved latent variables, such that the marginal distribution of the observed variables is intractable. The proposed test generalizes the recently proposed kernel Stein discrepancy (KSD) tests (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018) to the case of latent variable models, a much more general class than the fully observed models treated previously. The new test, with a properly calibrated threshold, has a well-controlled type-I error. In the case of certain models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative Maximum Mean Discrepancy test, which is based on samples from the models and does not exploit the latent structure.

MLOct 27, 2018
Informative Features for Model Comparison

Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn Sangkloy et al.

Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.

LGMar 8, 2018
Cross-domain Recommendation via Deep Domain Adaptation

Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu et al.

The behavior of users in certain services could be a clue that can be used to infer their preferences and may be used to make recommendations for other services they have never used. However, the cross-domain relationships between items and user consumption patterns are not simple, especially when there are few or no common users and items across domains. To address this problem, we propose a content-based cross-domain recommendation method for cold-start users that does not require user- and item- overlap. We formulate recommendation as extreme multi-class classification where labels (items) corresponding to the users are predicted. With this formulation, the problem is reduced to a domain adaptation setting, in which a classifier trained in the source domain is adapted to the target domain. For this, we construct a neural network that combines an architecture for domain adaptation, Domain Separation Network, with a denoising autoencoder for item representation. We assess the performance of our approach in experiments on a pair of data sets collected from movie and news services of Yahoo! JAPAN and show that our approach outperforms several baseline methods including a cross-domain collaborative filtering method.