MLCLLGJul 18, 2018

Efficient Training on Very Large Corpora via Gramian Estimation

arXiv:1807.07187v150 citations
Originality Incremental advance
AI Analysis

This addresses scalability issues for researchers and practitioners working with large datasets in similarity learning, though it is incremental as it builds on matrix factorization techniques.

The paper tackles the problem of expensive training of neural network embedding models on very large corpora due to quadratic sampling costs, and proposes efficient methods using Gramian estimation to avoid sampling unobserved pairs, resulting in significant improvements in training time and generalization quality in large-scale experiments.

We study the problem of learning similarity functions over very large corpora using neural network embedding models. These models are typically trained using SGD with sampling of random observed and unobserved pairs, with a number of samples that grows quadratically with the corpus size, making it expensive to scale to very large corpora. We propose new efficient methods to train these models without having to sample unobserved pairs. Inspired by matrix factorization, our approach relies on adding a global quadratic penalty to all pairs of examples and expressing this term as the matrix-inner-product of two generalized Gramians. We show that the gradient of this term can be efficiently computed by maintaining estimates of the Gramians, and develop variance reduction schemes to improve the quality of the estimates. We conduct large-scale experiments that show a significant improvement in training time and generalization quality compared to traditional sampling methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes