Takeshi Teshima

LG
9papers
402citations
Novelty51%
AI Score26

9 Papers

LGApr 15, 2022
Universal approximation property of invertible neural networks

Isao Ishikawa, Takeshi Teshima, Koichi Tojo et al.

Invertible neural networks (INNs) are neural network architectures with invertibility by design. Thanks to their invertibility and the tractability of Jacobian, INNs have various machine learning applications such as probabilistic modeling, generative modeling, and representation learning. However, their attractive properties often come at the cost of restricting the layer designs, which poses a question on their representation power: can we use these models to approximate sufficiently diverse functions? To answer this question, we have developed a general theoretical framework to investigate the representation power of INNs, building on a structure theorem of differential geometry. The framework simplifies the approximation problem of diffeomorphisms, which enables us to show the universal approximation properties of INNs. We apply the framework to two representative classes of INNs, namely Coupling-Flow-based INNs (CF-INNs) and Neural Ordinary Differential Equations (NODEs), and elucidate their high representation power despite the restrictions on their architectures.

LGDec 19, 2021
Rethinking Importance Weighting for Transfer Learning

Nan Lu, Tianyi Zhang, Tongtong Fang et al.

A key assumption in supervised learning is that training and test data follow the same probability distribution. However, this fundamental assumption is not always satisfied in practice, e.g., due to changing environments, sample selection bias, privacy concerns, or high labeling costs. Transfer learning (TL) relaxes this assumption and allows us to learn under distribution shift. Classical TL methods typically rely on importance-weighting -- a predictor is trained based on the training losses weighted according to the importance (i.e., the test-over-training density ratio). However, as real-world machine learning tasks are becoming increasingly complex, high-dimensional, and dynamical, novel approaches are explored to cope with such challenges recently. In this article, after introducing the foundation of TL based on importance-weighting, we review recent advances based on joint and dynamic importance-predictor estimation. Furthermore, we introduce a method of causal mechanism transfer that incorporates causal structure in TL. Finally, we discuss future perspectives of TL research.

LGFeb 27, 2021
Incorporating Causal Graphical Prior Knowledge into Predictive Modeling via Simple Data Augmentation

Takeshi Teshima, Masashi Sugiyama

Causal graphs (CGs) are compact representations of the knowledge of the data generating processes behind the data distributions. When a CG is available, e.g., from the domain knowledge, we can infer the conditional independence (CI) relations that should hold in the data distribution. However, it is not straightforward how to incorporate this knowledge into predictive modeling. In this work, we propose a model-agnostic data augmentation method that allows us to exploit the prior knowledge of the CI encoded in a CG for supervised machine learning. We theoretically justify the proposed method by providing an excess risk bound indicating that the proposed method suppresses overfitting by reducing the apparent complexity of the predictor hypothesis class. Using real-world data with CGs provided by domain experts, we experimentally show that the proposed method is effective in improving the prediction accuracy, especially in the small-data regime.

LGDec 4, 2020
Universal Approximation Property of Neural Ordinary Differential Equations

Takeshi Teshima, Koichi Tojo, Masahiro Ikeda et al.

Neural ordinary differential equations (NODEs) is an invertible neural network architecture promising for its free-form Jacobian and the availability of a tractable Jacobian determinant estimator. Recently, the representation power of NODEs has been partly uncovered: they form an $L^p$-universal approximator for continuous maps under certain conditions. However, the $L^p$-universality may fail to guarantee an approximation for the entire input domain as it may still hold even if the approximator largely differs from the target function on a small region of the input space. To further uncover the potential of NODEs, we show their stronger approximation property, namely the $\sup$-universality for approximating a large class of diffeomorphisms. It is shown by leveraging a structure theorem of the diffeomorphism group, and the result complements the existing literature by establishing a fairly large set of mappings that NODEs can approximate with a stronger guarantee.

LGJun 20, 2020
Coupling-based Invertible Neural Networks Are Universal Diffeomorphism Approximators

Takeshi Teshima, Isao Ishikawa, Koichi Tojo et al.

Invertible neural networks based on coupling flows (CF-INNs) have various machine learning applications such as image synthesis and representation learning. However, their desirable characteristics such as analytic invertibility come at the cost of restricting the functional forms. This poses a question on their representation power: are CF-INNs universal approximators for invertible functions? Without a universality, there could be a well-behaved invertible transformation that the CF-INN can never approximate, hence it would render the model class unreliable. We answer this question by showing a convenient criterion: a CF-INN is universal if its layers contain affine coupling and invertible linear functions as special cases. As its corollary, we can affirmatively resolve a previously unsolved problem: whether normalizing flow models based on affine coupling can be universal distributional approximators. In the course of proving the universality, we prove a general theorem to show the equivalence of the universality for certain diffeomorphism classes, a theoretical insight that is of interest by itself.

MLJun 13, 2020
$γ$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator

Masahiro Fujisawa, Takeshi Teshima, Issei Sato et al.

Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based $γ$-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable theoretical robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear-time computational complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures.

LGJun 12, 2020
Non-Negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation

Masahiro Kato, Takeshi Teshima

Density ratio estimation (DRE) is at the core of various machine learning tasks such as anomaly detection and domain adaptation. In existing studies on DRE, methods based on Bregman divergence (BD) minimization have been extensively studied. However, BD minimization when applied with highly flexible models, such as deep neural networks, tends to suffer from what we call train-loss hacking, which is a source of overfitting caused by a typical characteristic of empirical BD estimators. In this paper, to mitigate train-loss hacking, we propose a non-negative correction for empirical BD estimators. Theoretically, we confirm the soundness of the proposed method through a generalization error bound. Through our experiments, the proposed methods show a favorable performance in inlier-based outlier detection.

LGFeb 10, 2020
Few-shot Domain Adaptation by Causal Mechanism Transfer

Takeshi Teshima, Issei Sato, Masashi Sugiyama

We study few-shot supervised domain adaptation (DA) for regression problems, where only a few labeled target domain data and many labeled source domain data are available. Many of the current DA methods base their transfer assumptions on either parametrized distribution shift or apparent distribution similarities, e.g., identical conditionals or small distributional discrepancies. However, these assumptions may preclude the possibility of adaptation from intricately shifted and apparently very different distributions. To overcome this problem, we propose mechanism transfer, a meta-distributional scenario in which a data generating mechanism is invariant among domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for DA. We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally. Our method can be seen as the first attempt to fully leverage the structural causal models for DA.

LGSep 13, 2018
Clipped Matrix Completion: A Remedy for Ceiling Effects

Takeshi Teshima, Miao Xu, Issei Sato et al.

We consider the problem of recovering a low-rank matrix from its clipped observations. Clipping is conceivable in many scientific areas that obstructs statistical analyses. On the other hand, matrix completion (MC) methods can recover a low-rank matrix from various information deficits by using the principle of low-rank completion. However, the current theoretical guarantees for low-rank MC do not apply to clipped matrices, as the deficit depends on the underlying values. Therefore, the feasibility of clipped matrix completion (CMC) is not trivial. In this paper, we first provide a theoretical guarantee for the exact recovery of CMC by using a trace-norm minimization algorithm. Furthermore, we propose practical CMC algorithms by extending ordinary MC methods. Our extension is to use the squared hinge loss in place of the squared loss for reducing the penalty of over-estimation on clipped entries. We also propose a novel regularization term tailored for CMC. It is a combination of two trace-norm terms, and we theoretically bound the recovery error under the regularization. We demonstrate the effectiveness of the proposed methods through experiments using both synthetic and benchmark data for recommendation systems.