Mathieu Guillame-Bert

LG
h-index8
8papers
48citations
Novelty53%
AI Score27

8 Papers

LGAug 7, 2023
Generative Forests

Richard Nock, Mathieu Guillame-Bert

We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces two key contributions: a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting. This algorithm can be implemented by a few tweaks to the most popular induction scheme for decision tree induction (i.e. supervised learning) with two classes. Experiments on the quality of generated data display substantial improvements compared to the state of the art. The losses our algorithm minimize and the structure of our models make them practical for related tasks that require fast estimation of a density given a generative model and an observation (even partially specified): such tasks include missing data imputation and density estimation. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods, relying on models as diverse as (or mixing elements of) trees, neural nets, kernels or graphical models.

LGDec 6, 2022
Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library

Mathieu Guillame-Bert, Sebastian Bruch, Richard Stotz et al.

Yggdrasil Decision Forests is a library for the training, serving and interpretation of decision forest models, targeted both at research and production work, implemented in C++, and available in C++, command line interface, Python (under the name TensorFlow Decision Forests), JavaScript, Go, and Google Sheets (under the name Simple ML for Sheets). The library has been developed organically since 2018 following a set of four design principles applicable to machine learning libraries and frameworks: simplicity of use, safety of use, modularity and high-level abstraction, and integration with other machine learning libraries. In this paper, we describe those principles in detail and present how they have been used to guide the design of the library. We then showcase the use of our library on a set of classical machine learning problems. Finally, we report a benchmark comparing our library to related solutions.

LGFeb 22, 2024
Boosting gets full Attention for Relational Learning

Mathieu Guillame-Bert, Richard Nock

More often than not in benchmark supervised ML, tabular data is flat, i.e. consists of a single $m \times d$ (rows, columns) file, but cases abound in the real world where observations are described by a set of tables with structural relationships. Neural nets-based deep models are a classical fit to incorporate general topological dependence among description features (pixels, words, etc.), but their suboptimality to tree-based models on tabular data is still well documented. In this paper, we introduce an attention mechanism for structured data that blends well with tree-based models in the training context of (gradient) boosting. Each aggregated model is a tree whose training involves two steps: first, simple tabular models are learned descending tables in a top-down fashion with boosting's class residuals on tables' features. Second, what has been learned progresses back bottom-up via attention and aggregation mechanisms, progressively crafting new features that complete at the end the set of observation features over which a single tree is learned, boosting's iteration clock is incremented and new class residuals are computed. Experiments on simulated and real-world domains display the competitiveness of our method against a state of the art containing both tree-based and neural nets-based models.

LGJan 26, 2022
Generative Trees: Adversarial and Copycat

Richard Nock, Mathieu Guillame-Bert

While Generative Adversarial Networks (GANs) achieve spectacular results on unstructured data like images, there is still a gap on tabular data, data for which state of the art supervised learning still favours to a large extent decision tree (DT)-based models. This paper proposes a new path forward for the generation of tabular data, exploiting decades-old understanding of the supervised task's best components for DT induction, from losses (properness), models (tree-based) to algorithms (boosting). The \textit{properness} condition on the supervised loss -- which postulates the optimality of Bayes rule -- leads us to a variational GAN-style loss formulation which is \textit{tight} when discriminators meet a calibration property trivially satisfied by DTs, and, under common assumptions about the supervised loss, yields "one loss to train against them all" for the generator: the $χ^2$. We then introduce tree-based generative models, \textit{generative trees} (GTs), meant to mirror on the generative side the good properties of DTs for classifying tabular data, with a boosting-compliant \textit{adversarial} training algorithm for GTs. We also introduce \textit{copycat training}, in which the generator copies at run time the underlying tree (graph) of the discriminator DT and completes it for the hardest discriminative task, with boosting compliant convergence. We test our algorithms on tasks including fake/real distinction, training from fake data and missing data imputation. Each one of these tasks displays that GTs can provide comparatively simple -- and interpretable -- contenders to sophisticated state of the art methods for data generation (using neural network models) or missing data imputation (relying on multiple imputation by chained equations with complex tree-based modeling).

LGSep 21, 2020
Modeling Text with Decision Forests using Categorical-Set Splits

Mathieu Guillame-Bert, Sebastian Bruch, Petr Mitrichev et al.

Decision forest algorithms typically model data by learning a binary tree structure recursively where every node splits the feature space into two sub-regions, sending examples into the left or right branch as a result. In axis-aligned decision forests, the "decision" to route an input example is the result of the evaluation of a condition on a single dimension in the feature space. Such conditions are learned using efficient, often greedy algorithms that optimize a local loss function. For example, a node's condition may be a threshold function applied to a numerical feature, and its parameter may be learned by sweeping over the set of values available at that node and choosing a threshold that maximizes some measure of purity. Crucially, whether an algorithm exists to learn and evaluate conditions for a feature type determines whether a decision forest algorithm can model that feature type at all. For example, decision forests today cannot consume textual features directly -- such features must be transformed to summary statistics instead. In this work, we set out to bridge that gap. We define a condition that is specific to categorical-set features -- defined as an unordered set of categorical variables -- and present an algorithm to learn it, thereby equipping decision forests with the ability to directly model text, albeit without preserving sequential order. Our algorithm is efficient during training and the resulting conditions are fast to evaluate with our extension of the QuickScorer inference algorithm. Experiments on benchmark text classification datasets demonstrate the utility and effectiveness of our proposal.

LGJul 29, 2020
Learning Representations for Axis-Aligned Decision Forests through Input Perturbation

Sebastian Bruch, Jan Pfeifer, Mathieu Guillame-bert

Axis-aligned decision forests have long been the leading class of machine learning algorithms for modeling tabular data. In many applications of machine learning such as learning-to-rank, decision forests deliver remarkable performance. They also possess other coveted characteristics such as interpretability. Despite their widespread use and rich history, decision forests to date fail to consume raw structured data such as text, or learn effective representations for them, a factor behind the success of deep neural networks in recent years. While there exist methods that construct smoothed decision forests to achieve representation learning, the resulting models are decision forests in name only: They are no longer axis-aligned, use stochastic decisions, or are not interpretable. Furthermore, none of the existing methods are appropriate for problems that require a Transfer Learning treatment. In this work, we present a novel but intuitive proposal to achieve representation learning for decision forests without imposing new restrictions or necessitating structural changes. Our model is simply a decision forest, possibly trained using any forest learning algorithm, atop a deep neural network. By approximating the gradients of the decision forest through input perturbation, a purely analytical procedure, the decision forest directs the neural network to learn or fine-tune representations. Our framework has the advantage that it is applicable to any arbitrary decision forest and that it allows the use of arbitrary deep neural networks for representation learning. We demonstrate the feasibility and effectiveness of our proposal through experiments on synthetic and benchmark classification datasets.

LGApr 18, 2018
Exact Distributed Training: Random Forest with Billions of Examples

Mathieu Guillame-Bert, Olivier Teytaud

We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We explain the proposed algorithm and compare it to related approaches for various complexity measures (time, ram, disk, and network complexity analysis). We report its running performances on artificial and real-world datasets of up to 18 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we empirically show that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. Given a dataset with 17.3B examples with 82 features (3 numerical, other categorical with high arity), our implementation trains a tree in 22h.

LGMar 8, 2016
Batched Lazy Decision Trees

Mathieu Guillame-Bert, Artur Dubrawski

We introduce a batched lazy algorithm for supervised classification using decision trees. It avoids unnecessary visits to irrelevant nodes when it is used to make predictions with either eagerly or lazily trained decision trees. A set of experiments demonstrate that the proposed algorithm can outperform both the conventional and lazy decision tree algorithms in terms of computation time as well as memory consumption, without compromising accuracy.