LGJun 26, 2023

Tanimoto Random Features for Scalable Molecular Machine Learning

Cambridge
arXiv:2306.14809v215 citationsh-index: 55
Originality Incremental advance
AI Analysis

This work addresses scalability issues in molecular machine learning for researchers and practitioners, though it is incremental as it builds on existing kernel methods.

The paper tackled the lack of scalable approximations for the Tanimoto kernel used in molecular similarity by proposing two novel random features, enabling large-scale datasets and extending the kernel to real-valued vectors, with experimental validation on real-world molecular tasks.

The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features to allow this kernel to scale to large datasets, and in the process discover a novel extension of the kernel to real-valued vectors. We theoretically characterize these random features, and provide error bounds on the spectral norm of the Gram matrix. Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes