LGAIMLJun 29, 2020

Handling Missing Data in Decision Trees: A Probabilistic Approach

arXiv:2006.16341v137 citations
Originality Incremental advance
AI Analysis

This addresses a common issue in machine learning for practitioners using decision trees, but it appears incremental as it builds on existing methods for handling missing data.

The paper tackles the problem of missing data in decision trees by introducing a probabilistic approach that uses tractable density estimators to compute expected predictions at deployment and fine-tunes tree parameters to minimize expected prediction loss during learning, with experiments showing effectiveness compared to baselines.

Decision trees are a popular family of models due to their attractive properties such as interpretability and ability to handle heterogeneous data. Concurrently, missing data is a prevalent occurrence that hinders performance of machine learning models. As such, handling missing data in decision trees is a well studied problem. In this paper, we tackle this problem by taking a probabilistic approach. At deployment time, we use tractable density estimators to compute the "expected prediction" of our models. At learning time, we fine-tune parameters of already learned trees by minimizing their "expected prediction loss" w.r.t.\ our density estimators. We provide brief experiments showcasing effectiveness of our methods compared to few baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes