LGDSPFApr 19

Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels

arXiv:2601.0273526.11 citationsh-index: 5
AI Analysis

For practitioners using random forests for kernel methods or representation learning, this work removes a key scalability bottleneck by providing an efficient exact computation of proximity matrices.

The paper introduces a unified view of forest proximities as sparse leaf-incidence kernels, enabling exact finite-sample sparse factorization that reduces computation from quadratic to near-linear time and memory. Benchmarks confirm near-linear scaling across datasets and forest settings.

Decision forests induce supervised similarities through the partition structure of their trees. Yet forest proximity computation is still often treated as a quadratic operation in the number of samples, which limits scalability and restricts broader use in kernel and representation-learning pipelines. We introduce a unified view of leaf-collision forest proximities through a class of Separable Weighted Leaf-Collision (SWLC) kernels, showing that most existing proximities differ only in their weighting scheme while sharing a common sparse leaf-incidence structure. This yields an explicit leaf-space representation that clarifies their kernel interpretation and leads to an exact finite-sample sparse factorization of the proximity matrix, avoiding an explicit all-pairs comparison and reducing computation to sparse linear algebra over leaf collisions. We implement this framework in a memory-efficient Python library and show, both theoretically and empirically, that exact kernel computation scales near-linearly in time and memory under standard forest regimes. Benchmarks verify the predicted scaling behavior in practice across datasets, proximity definitions, and forest settings, and show that the resulting sparse leaf-space representation can also be used directly for fast task-aware embedding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes