Mahito Sugiyama

ML
h-index11
30papers
181citations
Novelty51%
AI Score49

30 Papers

MLSep 30, 2022
Many-body Approximation for Non-negative Tensors

Kazu Ghalamkari, Mahito Sugiyama, Yoshinobu Kawahara

We present an alternative approach to decompose non-negative tensors, called many-body approximation. Traditional decomposition methods assume low-rankness in the representation, resulting in difficulties in global optimization and target rank selection. We avoid these problems by energy-based modeling of tensors, where a tensor and its mode correspond to a probability distribution and a random variable, respectively. Our model can be globally optimized in terms of the KL divergence minimization by taking the interaction between variables (that is, modes), into account that can be tuned more intuitively than ranks. Furthermore, we visualize interactions between modes as tensor networks and reveal a nontrivial relationship between many-body approximation and low-rank approximation. We demonstrate the effectiveness of our approach in tensor completion and approximation.

21.6LGApr 7Code
Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings

Laurits Fredsgaard, Aaron Thomas, Michael Riis Andersen et al.

Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC $=0.43$), while LU achieves AUC $=0.85$, suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization-uncertainty

BMDec 11, 2022
Molecular Graph Generation by Decomposition and Reassembling

Masatsugu Yamada, Mahito Sugiyama

Designing molecular structures with desired chemical properties is an essential task in drug discovery and material design. However, finding molecules with the optimized desired properties is still a challenging task due to combinatorial explosion of candidate space of molecules. Here we propose a novel \emph{decomposition-and-reassembling} based approach, which does not include any optimization in hidden space and our generation process is highly interpretable. Our method is a two-step procedure: In the first decomposition step, we apply frequent subgraph mining to a molecular database to collect smaller size of subgraphs as building blocks of molecules. In the second reassembling step, we search desirable building blocks guided via reinforcement learning and combine them to generate new molecules. Our experiments show that not only can our method find better molecules in terms of two standard criteria, the penalized $\log P$ and drug-likeness, but also generate drug molecules with showing the valid intermediate molecules.

LGMay 25, 2022
Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel

Ryuichi Kanoh, Mahito Sugiyama

A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.

14.2LGMay 7
When Graph Language Models Go Beyond Memorization

Masatsugu Yamada, Mahito Sugiyama

It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond memorization at scale, primarily in the high-frequency regime. This is supported by the following empirical evidence: On five TU benchmarks, LLaMA-style graph language models reach high subgraph-rank correlation, yet their alignment is matched or exceeded by the memorization bootstrap in most cases. At small scale, under our bootstrap diagnostic, fidelity is largely indistinguishable from verbatim recall. In contrast, at large scale with 3.75M graphs, verbatim memorization drops sharply while rank correlation remains near ceiling. Crucially, in a separate fixed-subsample analysis, frequent subgraph mining restricted to the novel-only subset closely tracks the corresponding all-generation Spearman correlation, providing evidence that the alignment is not driven solely by verbatim recall. Across all scales, high-frequency patterns are well reproduced, while rare patterns remain poorly covered, and this deficit narrows only marginally as capacity increases. We observe the same scale-dependent crossover under two distinct graph serializations (canonical DFS code and action sequences), providing evidence of robustness in our analysis.

LGFeb 29, 2024
StiefelGen: A Simple, Model Agnostic Approach for Time Series Data Augmentation over Riemannian Manifolds

Prasad Cheema, Mahito Sugiyama

Data augmentation is an area of research which has seen active development in many machine learning fields, such as in image-based learning models, reinforcement learning for self driving vehicles, and general noise injection for point cloud data. However, convincing methods for general time series data augmentation still leaves much to be desired, especially since the methods developed for these models do not readily cross-over. Three common approaches for time series data augmentation include: (i) Constructing a physics-based model and then imbuing uncertainty over the coefficient space (for example), (ii) Adding noise to the observed data set(s), and, (iii) Having access to ample amounts of time series data sets from which a robust generative neural network model can be trained. However, for many practical problems that work with time series data in the industry: (i) One usually does not have access to a robust physical model, (ii) The addition of noise can in of itself require large or difficult assumptions (for example, what probability distribution should be used? Or, how large should the noise variance be?), and, (iii) In practice, it can be difficult to source a large representative time series data base with which to train the neural network model for the underlying problem. In this paper, we propose a methodology which attempts to simultaneously tackle all three of these previous limitations to a large extent. The method relies upon the well-studied matrix differential geometry of the Stiefel manifold, as it proposes a simple way in which time series signals can placed on, and then smoothly perturbed over the manifold. We attempt to clarify how this method works by showcasing several potential use cases which in particular work to take advantage of the unique properties of this underlying manifold.

LGOct 15, 2024
A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection

James Enouen, Mahito Sugiyama

The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the vast majority of such energy-based modeling approaches only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher-order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. On both synthetic and real-world datasets, we demonstrate our algorithm's effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.

LGMay 20, 2025
New Evidence of the Two-Phase Learning Dynamics of Neural Networks

Zhanpeng Zhou, Yongyi Yang, Mahito Sugiyama et al.

Understanding how deep neural networks learn remains a fundamental challenge in modern machine learning. A growing body of evidence suggests that training dynamics undergo a distinct phase transition, yet our understanding of this transition is still incomplete. In this paper, we introduce an interval-wise perspective that compares network states across a time window, revealing two new phenomena that illuminate the two-phase nature of deep learning. i) \textbf{The Chaos Effect.} By injecting an imperceptibly small parameter perturbation at various stages, we show that the response of the network to the perturbation exhibits a transition from chaotic to stable, suggesting there is an early critical period where the network is highly sensitive to initial conditions; ii) \textbf{The Cone Effect.} Tracking the evolution of the empirical Neural Tangent Kernel (eNTK), we find that after this transition point the model's functional trajectory is confined to a narrow cone-shaped subset: while the kernel continues to change, it gets trapped into a tight angular region. Together, these effects provide a structural, dynamical view of how deep networks transition from sensitive exploration to stable refinement during training.

SEApr 11, 2025
Bringing Structure to Naturalness: On the Naturalness of ASTs

Profir-Petru Pârţachi, Mahito Sugiyama

Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language as well as highlighted its statistical predictability at the token level: source code can be natural. More recently, the structure of code -- control flow, syntax graphs, abstract syntax trees etc. -- has been successfully used to improve the state-of-the-art on numerous tasks: code suggestion, code summarisation, method naming etc. This body of work implicitly assumes that structured representations of code are similarly statistically predictable, i.e. that a structured view of code is also natural. We consider that this view should be made explicit and propose directly studying the Structured Naturalness Hypothesis. Beyond just naming existing research that assumes this hypothesis and formulating it, we also provide evidence in the case of trees: TreeLSTM models over ASTs for some languages, such as Ruby, are competitive with $n$-gram models while handling the syntax token issue highlighted by previous research 'for free'. For other languages, such as Java or Python, we find tree models to perform worse, suggesting that downstream task improvement is uncorrelated to the language modelling task. Further, we show how such naturalness signals can be employed for near state-of-the-art results on just-in-time defect prediction while forgoing manual feature engineering work.

LGMar 20, 2025
On the Cone Effect in the Learning Dynamics

Zhanpeng Zhou, Yongyi Yang, Jie Ren et al.

Understanding the learning dynamics of neural networks is a central topic in the deep learning community. In this paper, we take an empirical perspective to study the learning dynamics of neural networks in real-world settings. Specifically, we investigate the evolution process of the empirical Neural Tangent Kernel (eNTK) during training. Our key findings reveal a two-phase learning process: i) in Phase I, the eNTK evolves significantly, signaling the rich regime, and ii) in Phase II, the eNTK keeps evolving but is constrained in a narrow space, a phenomenon we term the cone effect. This two-phase framework builds on the hypothesis proposed by Fort et al. (2020), but we uniquely identify the cone effect in Phase II, demonstrating its significant performance advantages over fully linearized training.

LGMay 23, 2024
Linear Mode Connectivity in Differentiable Tree Ensembles

Ryuichi Kanoh, Mahito Sugiyama

Linear Mode Connectivity (LMC) refers to the phenomenon that performance remains consistent for linearly interpolated models in the parameter space. For independently optimized model pairs from different random initializations, achieving LMC is considered crucial for understanding the stable success of the non-convex optimization in modern machine learning models and for facilitating practical parameter-based operations such as model merging. While LMC has been achieved for neural networks by considering the permutation invariance of neurons in each hidden layer, its attainment for other models remains an open question. In this paper, we first achieve LMC for soft tree ensembles, which are tree-based differentiable models extensively used in practice. We show the necessity of incorporating two invariances: subtree flip invariance and splitting order invariance, which do not exist in neural networks but are inherent to tree architectures, in addition to permutation invariance of trees. Moreover, we demonstrate that it is even possible to exclude such additional invariances while keeping LMC by designing decision list-based tree architectures, where such invariances do not exist by definition. Our findings indicate the significance of accounting for architecture-specific invariances in achieving LMC.

CGMar 5
Quadratic polarity and polar Fenchel-Young divergences from the canonical Legendre polarity

Frank Nielsen, Basile Plus-Gourdon, Mahito Sugiyama

Polarity is a fundamental reciprocal duality of $n$-dimensional projective geometry which associates to points polar hyperplanes, and more generally $k$-dimensional convex bodies to polar $(n-1-k)$-dimensional convex bodies. It is well-known that the Legendre-Fenchel transformation of functions can be interpreted from the polarity viewpoint of their graphs using an extra dimension. In this paper, we first show that generic polarities induced by quadratic polarity functionals can be expressed either as deformed Legendre polarity or as the Legendre polarity of deformed convex bodies, and be efficiently manipulated using linear algebra on $(n+2)\times (n+2)$ matrices operating on homogeneous coordinates. Second, we define polar divergences using the Legendre polarity and show that they generalize the Fenchel-Young divergence or equivalent Bregman divergence. This polarity study brings new understanding of the core reference duality in information geometry. Last, we show that the total Bregman divergences can be considered as a total polar Fenchel-Young divergence from which we newly exhibit the reference duality using dual polar conformal factors.

MLOct 25, 2021
Fast Rank-1 NMF for Missing Data with KL Divergence

Kazu Ghalamkari, Mahito Sugiyama

We propose a fast non-gradient-based method of rank-1 non-negative matrix factorization (NMF) for missing data, called A1GM, that minimizes the KL divergence from an input matrix to the reconstructed rank-1 matrix. Our method is based on our new finding of an analytical closed-formula of the best rank-1 non-negative multiple matrix factorization (NMMF), a variety of NMF. NMMF is known to exactly solve NMF for missing data if positions of missing values satisfy a certain condition, and A1GM transforms a given matrix so that the analytical solution to NMMF can be applied. We empirically show that A1GM is more efficient than a gradient method with competitive reconstruction errors.

LGSep 10, 2021
A Neural Tangent Kernel Perspective of Infinite Tree Ensembles

Ryuichi Kanoh, Mahito Sugiyama

In practical situations, the tree ensemble is one of the most popular models along with neural networks. A soft tree is a variant of a decision tree. Instead of using a greedy method for searching splitting rules, the soft tree is trained using a gradient method in which the entire splitting operation is formulated in a differentiable form. Although ensembles of such soft trees have been used increasingly in recent years, little theoretical work has been done to understand their behavior. By considering an ensemble of infinite soft trees, this paper introduces and studies the Tree Neural Tangent Kernel (TNTK), which provides new insights into the behavior of the infinite ensemble of soft trees. Using the TNTK, we theoretically identify several non-trivial properties, such as global convergence of the training, the equivalence of the oblivious tree structure, and the degeneracy of the TNTK induced by the deepening of the trees.

LGMar 5, 2021
Unintended Effects on Adaptive Learning Rate for Training Neural Network with Output Scale Change

Ryuichi Kanoh, Mahito Sugiyama

A multiplicative constant scaling factor is often applied to the model output to adjust the dynamics of neural network parameters. This has been used as one of the key interventions in an empirical study of lazy and active behavior. However, we show that the combination of such scaling and a commonly used adaptive learning rate optimizer strongly affects the training behavior of the neural network. This is problematic as it can cause \emph{unintended behavior} of neural networks, resulting in the misinterpretation of experimental results. Specifically, for some scaling settings, the effect of the adaptive learning rate disappears or is strongly influenced by the scaling factor. To avoid the unintended effect, we present a modification of an optimization algorithm and demonstrate remarkable differences between adaptive learning rate optimization and simple gradient descent, especially with a small ($<1.0$) scaling factor.

MLMar 4, 2021
Fast Tucker Rank Reduction for Non-Negative Tensors Using Mean-Field Approximation

Kazu Ghalamkari, Mahito Sugiyama

We present an efficient low-rank approximation algorithm for non-negative tensors. The algorithm is derived from our two findings: First, we show that rank-1 approximation for tensors can be viewed as a mean-field approximation by treating each tensor as a probability distribution. Second, we theoretically provide a sufficient condition for distribution parameters to reduce Tucker ranks of tensors; interestingly, this sufficient condition can be achieved by iterative application of the mean-field approximation. Since the mean-field approximation is always given as a closed formula, our findings lead to a fast low-rank approximation algorithm without using a gradient method. We empirically demonstrate that our algorithm is faster than the existing non-negative Tucker rank reduction methods and achieves competitive or better approximation of given tensors.

MLJun 16, 2020
Additive Poisson Process: Learning Intensity of Higher-Order Interaction in Stochastic Processes

Simon Luo, Feng Zhou, Lamiae Azizi et al.

We present the Additive Poisson Process (APP), a novel framework that can model the higher-order interaction effects of the intensity functions in stochastic processes using lower dimensional projections. Our model combines the techniques in information geometry to model higher-order interactions on a statistical manifold and in generalized additive models to use lower-dimensional projections to overcome the effects from the curse of dimensionality. Our approach solves a convex optimization problem by minimizing the KL divergence from a sample distribution in lower dimensional projections to the distribution modeled by an intensity function in the stochastic process. Our empirical results show that our model is able to use samples observed in the lower dimensional space to estimate the higher-order intensity function with extremely sparse observations.

MLJun 9, 2020
Fast Rank Reduction for Non-negative Matrices via Mean Field Theory

Kazu Ghalamkari, Mahito Sugiyama

We propose an efficient matrix rank reduction method for non-negative matrices, whose time complexity is quadratic in the number of rows or columns of a matrix. Our key insight is to formulate rank reduction as a mean-field approximation by modeling matrices via a log-linear model on structured sample space, which allows us to solve the rank reduction as convex optimization. The highlight of this formulation is that the optimal solution that minimizes the KL divergence from a given matrix can be analytically computed in a closed form. We empirically show that our rank reduction method is faster than NMF and its popular variant, lraNMF, while achieving competitive low rank approximation error on synthetic and real-world datasets.

MLJun 8, 2020
Double Descent Risk and Volume Saturation Effects: A Geometric Perspective

Prasad Cheema, Mahito Sugiyama

The appearance of the double-descent risk phenomenon has received growing interest in the machine learning and statistics community, as it challenges well-understood notions behind the U-shaped train-test curves. Motivated through Rissanen's minimum description length (MDL), Balasubramanian's Occam's Razor, and Amari's information geometry, we investigate how the logarithm of the model volume: $\log V$, works to extend intuition behind the AIC and BIC model selection criteria. We find that for the particular model classes of isotropic linear regression and statistical lattices, the $\log V$ term may be decomposed into a sum of distinct components, each of which assist in their explanations of the appearance of this phenomenon. In particular they suggest why generalization error does not necessarily continue to grow with increasing model dimensionality.

MLSep 25, 2019
Hierarchical Probabilistic Model for Blind Source Separation via Legendre Transformation

Simon Luo, Lamiae Azizi, Mahito Sugiyama

We present a novel blind source separation (BSS) method, called information geometric blind source separation (IGBSS). Our formulation is based on the log-linear model equipped with a hierarchically structured sample space, which has theoretical guarantees to uniquely recover a set of source signals by minimizing the KL divergence from a set of mixed signals. Source signals, received signals, and mixing matrices are realized as different layers in our hierarchical sample space. Our empirical results have demonstrated on images and time series data that our approach is superior to well established techniques and is able to separate signals with complex interactions.

MLJun 28, 2019
Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Simon Luo, Mahito Sugiyama

Hierarchical probabilistic models are able to use a large number of parameters to create a model with a high representation power. However, it is well known that increasing the number of parameters also increases the complexity of the model which leads to a bias-variance trade-off. Although it is a classical problem, the bias-variance trade-off between hidden layers and higher-order interactions have not been well studied. In our study, we propose an efficient inference algorithm for the log-linear formulation of the higher-order Boltzmann machine using a combination of Gibbs sampling and annealed importance sampling. We then perform a bias-variance decomposition to study the differences in hidden layers and higher-order interactions. Our results have shown that using hidden layers and higher-order interactions have a comparable error with a similar order of magnitude and using higher-order interactions produce less variance for smaller sample size.

LGDec 8, 2018
Learning Graph Representation via Formal Concept Analysis

Yuka Yoneda, Mahito Sugiyama, Takashi Washio

We present a novel method that can learn a graph representation from multivariate data. In our representation, each node represents a cluster of data points and each edge represents the subset-superset relationship between clusters, which can be mutually overlapped. The key to our method is to use formal concept analysis (FCA), which can extract hierarchical relationships between clusters based on the algebraic closedness property. We empirically show that our method can effectively extract hierarchical structures of clusters compared to the baseline method.

MLMay 21, 2018
Transductive Boltzmann Machines

Mahito Sugiyama, Koji Tsuda, Hiroyuki Nakahara

We present transductive Boltzmann machines (TBMs), which firstly achieve transductive learning of the Gibbs distribution. While exact learning of the Gibbs distribution is impossible by the family of existing Boltzmann machines due to combinatorial explosion of the sample space, TBMs overcome the problem by adaptively constructing the minimum required sample space from data to avoid unnecessary generalization. We theoretically provide bias-variance decomposition of the KL divergence in TBMs to analyze its learnability, and empirically demonstrate that TBMs are superior to the fully visible Boltzmann machines and popularly used restricted Boltzmann machines in terms of efficiency and effectiveness.

MLFeb 13, 2018
Legendre Decomposition for Tensors

Mahito Sugiyama, Hiroyuki Nakahara, Koji Tsuda

We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than other nonnegative tensor decomposition methods.

MLFeb 28, 2017
Finding Statistically Significant Interactions between Continuous Features

Mahito Sugiyama, Karsten Borgwardt

The search for higher-order feature interactions that are statistically significantly associated with a class variable is of high relevance in fields such as Genetics or Healthcare, but the combinatorial explosion of the candidate space makes this problem extremely challenging in terms of computational efficiency and proper correction for multiple testing. While recent progress has been made regarding this challenge for binary features, we here present the first solution for continuous features. We propose an algorithm which overcomes the combinatorial explosion of the search space of higher-order interactions by deriving a lower bound on the p-value for each interaction, which enables us to massively prune interactions that can never reach significance and to thereby gain more statistical power. In our experiments, our approach efficiently detects all significant interactions in a variety of synthetic and real-world datasets.

MEFeb 27, 2017
Tensor Balancing on Statistical Manifold

Mahito Sugiyama, Hiroyuki Nakahara, Koji Tsuda

We solve tensor balancing, rescaling an Nth order nonnegative tensor by multiplying N tensors of order N - 1 so that every fiber sums to one. This generalizes a fundamental process of matrix balancing used to compare matrices in a wide range of applications from biology to economics. We present an efficient balancing algorithm with quadratic convergence using Newton's method and show in numerical experiments that the proposed algorithm is several orders of magnitude faster than existing ones. To theoretically prove the correctness of the algorithm, we model tensors as probability distributions in a statistical manifold and realize tensor balancing as projection onto a submanifold. The key to our algorithm is that the gradient of the manifold, used as a Jacobian matrix in Newton's method, can be analytically obtained using the Moebius inversion formula, the essential of combinatorial mathematics. Our model is not limited to tensor balancing, but has a wide applicability as it includes various statistical and machine learning models such as weighted DAGs and Boltzmann machines.

MLFeb 15, 2016
Selective Inference Approach for Statistically Sound Predictive Pattern Mining

Shinya Suzumura, Kazuya Nakagawa, Mahito Sugiyama et al.

Discovering statistically significant patterns from databases is an important challenging problem. The main obstacle of this problem is in the difficulty of taking into account the selection bias, i.e., the bias arising from the fact that patterns are selected from extremely large number of candidates in databases. In this paper, we introduce a new approach for predictive pattern mining problems that can address the selection bias issue. Our approach is built on a recently popularized statistical inference framework called selective inference. In selective inference, statistical inferences (such as statistical hypothesis testing) are conducted based on sampling distributions conditional on a selection event. If the selection event is characterized in a tractable way, statistical inferences can be made without minding selection bias issue. However, in pattern mining problems, it is difficult to characterize the entire selection process of mining algorithms. Our main contribution in this paper is to solve this challenging problem for a class of predictive pattern mining problems by introducing a novel algorithmic framework. We demonstrate that our approach is useful for finding statistically significant patterns from databases.

MLFeb 15, 2015
Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing

Felipe Llinares López, Mahito Sugiyama, Laetitia Papaxanthos et al.

We present a novel algorithm, Westfall-Young light, for detecting patterns, such as itemsets and subgraphs, which are statistically significantly enriched in one of two classes. Our method corrects rigorously for multiple hypothesis testing and correlations between patterns through the Westfall-Young permutation procedure, which empirically estimates the null distribution of pattern frequencies in each class via permutations. In our experiments, Westfall-Young light dramatically outperforms the current state-of-the-art approach in terms of both runtime and memory efficiency on popular real-world benchmark datasets for pattern mining. The key to this efficiency is that unlike all existing methods, our algorithm neither needs to solve the underlying frequent itemset mining problem anew for each permutation nor needs to store the occurrence list of all frequent patterns. Westfall-Young light opens the door to significant pattern mining on large datasets that previously led to prohibitive runtime or memory costs.

MLJul 4, 2014
Identifying Higher-order Combinations of Binary Features

Felipe Llinares, Mahito Sugiyama, Karsten M. Borgwardt

Finding statistically significant interactions between binary variables is computationally and statistically challenging in high-dimensional settings, due to the combinatorial explosion in the number of hypotheses. Terada et al. recently showed how to elegantly address this multiple testing problem by excluding non-testable hypotheses. Still, it remains unclear how their approach scales to large datasets. We here proposed strategies to speed up the approach by Terada et al. and evaluate them thoroughly in 11 real-world benchmark datasets. We observe that one approach, incremental search with early stopping, is orders of magnitude faster than the current state-of-the-art approach.

MEJul 1, 2014
Significant Subgraph Mining with Multiple Testing Correction

Mahito Sugiyama, Felipe Llinares López, Niklas Kasenburg et al.

The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.