LGSep 19, 2022
Toward Understanding Privileged Features Distillation in Learning-to-RankShuo Yang, Sujay Sanghavi, Holakou Rahmanian et al.
In learning-to-rank problems, a privileged feature is one that is available during model training, but not available at test time. Such features naturally arise in merchandised recommendation systems; for instance, "user clicked this item" as a feature is predictive of "user purchased this item" in the offline data, but is clearly not available during online serving. Another source of privileged features is those that are too expensive to compute online but feasible to be added offline. Privileged features distillation (PFD) refers to a natural idea: train a "teacher" model using all features (including privileged ones) and then use it to train a "student" model that does not use the privileged features. In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon's logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models. Both investigations uncover an interesting non-monotone behavior: as the predictive power of a privileged feature increases, the performance of the resulting student model initially increases but then decreases. We show the reason for the later decreasing performance is that a very predictive privileged teacher produces predictions with high variance, which lead to high variance student estimates and inferior testing performance.
IRDec 5, 2023Code
PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval ModelsWei-Cheng Chang, Jyun-Yu Jiang, Jiong Zhang et al.
Embedding-based Retrieval Models (ERMs) have emerged as a promising framework for large-scale text retrieval problems due to powerful large language models. Nevertheless, fine-tuning ERMs to reach state-of-the-art results can be expensive due to the extreme scale of data as well as the complexity of multi-stages pipelines (e.g., pre-training, fine-tuning, distillation). In this work, we propose the PEFA framework, namely ParamEter-Free Adapters, for fast tuning of ERMs without any backward pass in the optimization. At index building stage, PEFA equips the ERM with a non-parametric k-nearest neighbor (kNN) component. At inference stage, PEFA performs a convex combination of two scoring functions, one from the ERM and the other from the kNN. Based on the neighborhood definition, PEFA framework induces two realizations, namely PEFA-XL (i.e., extra large) using double ANN indices and PEFA-XS (i.e., extra small) using a single ANN index. Empirically, PEFA achieves significant improvement on two retrieval applications. For document retrieval, regarding Recall@100 metric, PEFA improves not only pre-trained ERMs on Trivia-QA by an average of 13.2%, but also fine-tuned ERMs on NQ-320K by an average of 5.5%, respectively. For product search, PEFA improves the Recall@100 of the fine-tuned ERMs by an average of 5.3% and 14.5%, for PEFA-XS and PEFA-XL, respectively. Our code is available at https://github.com/amzn/pecos/tree/mainline/examples/pefa-wsdm24.
CLFeb 15, 2025
Retrieval-augmented Encoders for Extreme Multi-label Text ClassificationYau-Shian Wang, Wei-Cheng Chang, Jyun-Yu Jiang et al.
Extreme multi-label classification (XMC) seeks to find relevant labels from an extremely large label collection for a given text input. To tackle such a vast label space, current state-of-the-art methods fall into two categories. The one-versus-all (OVA) method uses learnable label embeddings for each label, excelling at memorization (i.e., capturing detailed training signals for accurate head label prediction). In contrast, the dual-encoder (DE) model maps input and label text into a shared embedding space for better generalization (i.e., the capability of predicting tail labels with limited training data), but may fall short at memorization. To achieve generalization and memorization, existing XMC methods often combine DE and OVA models, which involves complex training pipelines. Inspired by the success of retrieval-augmented language models, we propose the Retrieval-augmented Encoders for XMC (RAEXMC), a novel framework that equips a DE model with retrieval-augmented capability for efficient memorization without additional trainable parameter. During training, RAEXMC is optimized by the contrastive loss over a knowledge memory that consists of both input instances and labels. During inference, given a test input, RAEXMC retrieves the top-$K$ keys from the knowledge memory, and aggregates the corresponding values as the prediction scores. We showcase the effectiveness and efficiency of RAEXMC on four public LF-XMC benchmarks. RAEXMC not only advances the state-of-the-art (SOTA) DE method DEXML, but also achieves more than 10x speedup on the largest LF-AmazonTitles-1.3M dataset under the same 8 A100 GPUs training environments.
LGApr 29, 2020
DS-FACTO: Doubly Separable Factorization MachinesParameswaran Raman, S. V. N. Vishwanathan
Factorization Machines (FM) are powerful class of models that incorporate higher-order interaction among features to add more expressive power to linear models. They have been used successfully in several real-world tasks such as click-prediction, ranking and recommender systems. Despite using a low-rank representation for the pairwise features, the memory overheads of using factorization machines on large-scale real-world datasets can be prohibitively high. For instance on the criteo tera dataset, assuming a modest $128$ dimensional latent representation and $10^{9}$ features, the memory requirement for the model is in the order of $1$ TB. In addition, the data itself occupies $2.1$ TB. Traditional algorithms for FM which work on a single-machine are not equipped to handle this scale and therefore, using a distributed algorithm to parallelize the computation across a cluster is inevitable. In this work, we propose a hybrid-parallel stochastic optimization algorithm DS-FACTO, which partitions both the data as well as parameters of the factorization machine simultaneously. Our solution is fully de-centralized and does not require the use of any parameter servers. We present empirical results to analyze the convergence behavior, predictive power and scalability of DS-FACTO.
IRAug 29, 2019
A Zero Attention Model for Personalized Product SearchQingyao Ai, Daniel N. Hill, S. V. N. Vishwanathan et al.
Product search is one of the most popular methods for people to discover and purchase products on e-commerce websites. Because personal preferences often have an important influence on the purchase decision of each customer, it is intuitive that personalization should be beneficial for product search engines. While synthetic experiments from previous studies show that purchase histories are useful for identifying the individual intent of each product search session, the effect of personalization on product search in practice, however, remains mostly unknown. In this paper, we formulate the problem of personalized product search and conduct large-scale experiments with search logs sampled from a commercial e-commerce search engine. Results from our preliminary analysis show that the potential of personalization depends on query characteristics, interactions between queries, and user purchase histories. Based on these observations, we propose a Zero Attention Model for product search that automatically determines when and how to personalize a user-query pair via a novel attention mechanism. Empirical results on commercial product search logs show that the proposed model not only significantly outperforms state-of-the-art personalized product retrieval models, but also provides important information on the potential of personalization in each product search session.
LGJun 2, 2017
Online Dynamic ProgrammingHolakou Rahmanian, Manfred K. Warmuth, S. V. N. Vishwanathan
We propose a general method for combinatorial online learning problems whose offline optimization problem can be solved efficiently via a dynamic programming algorithm defined by an arbitrary min-sum recurrence. Examples include online learning of Binary Search Trees, Matrix-Chain Multiplications, $k$-sets, Knapsacks, Rod Cuttings, and Weighted Interval Schedulings. For each of these problems we use the underlying graph of subproblems (called a multi-DAG) for defining a representation of the solutions of the dynamic programming problem by encoding them as a generalized version of paths (called multipaths). These multipaths encode each solution as a series of successive decisions or components over which the loss is linear. We then show that the dynamic programming algorithm for each problem leads to online algorithms for learning multipaths in the underlying multi-DAG. The algorithms maintain a distribution over the multipaths in a concise form as their hypothesis. More specifically we generalize the existing Expanded Hedge and Component Hedge algorithms for the online shortest path problem to learning multipaths. Additionally, we introduce a new and faster prediction technique for Component Hedge which in our case directly samples from a distribution over multipaths, bypassing the need to decompose the distribution over multipaths into a mixture with small support.
LGApr 22, 2017
Batch-Expansion Training: An Efficient Optimization FrameworkMichał Dereziński, Dhruv Mahajan, S. Sathiya Keerthi et al.
We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset. As opposed to stochastic approaches, batches do not need to be resampled i.i.d. at every iteration, thus making BET more resource efficient in a distributed setting, and when disk-access is constrained. Moreover, BET can be easily paired with most batch optimizers, does not require any parameter-tuning, and compares favorably to existing stochastic and batch methods. We show that when the batch size grows exponentially with the number of outer iterations, BET achieves optimal $O(1/ε)$ data-access convergence rate for strongly convex objectives. Experiments in parallel and distributed settings show that BET performs better than standard batch and stochastic approaches.
LGSep 17, 2016
Online Learning of Combinatorial Objects via Extended FormulationHolakou Rahmanian, David P. Helmbold, S. V. N. Vishwanathan
The standard techniques for online learning of combinatorial objects perform multiplicative updates followed by projections into the convex hull of all the objects. However, this methodology can be expensive if the convex hull contains many facets. For example, the convex hull of $n$-symbol Huffman trees is known to have exponentially many facets (Maurras et al., 2010). We get around this difficulty by exploiting extended formulations (Kaibel, 2011), which encode the polytope of combinatorial objects in a higher dimensional "extended" space with only polynomially many facets. We develop a general framework for converting extended formulations into efficient online algorithms with good relative loss bounds. We present applications of our framework to online learning of Huffman trees and permutations. The regret bounds of the resulting algorithms are within a factor of $O(\sqrt{\log(n)})$ of the state-of-the-art specialized algorithms for permutations, and depending on the loss regimes, improve on or match the state-of-the-art for Huffman trees. Our method is general and can be applied to other combinatorial objects.
MLMay 31, 2016
Extreme Stochastic Variational Inference: Distributed and AsynchronousJiong Zhang, Parameswaran Raman, Shihao Ji et al.
Stochastic variational inference (SVI), the state-of-the-art algorithm for scaling variational inference to large-datasets, is inherently serial. Moreover, it requires the parameters to fit in the memory of a single processor; this is problematic when the number of parameters is in billions. In this paper, we propose extreme stochastic variational inference (ESVI), an asynchronous and lock-free algorithm to perform variational inference for mixture models on massive real world datasets. ESVI overcomes the limitations of SVI by requiring that each processor only access a subset of the data and a subset of the parameters, thus providing data and model parallelism simultaneously. We demonstrate the effectiveness of ESVI by running Latent Dirichlet Allocation (LDA) on UMBC-3B, a dataset that has a vocabulary of 3 million and a token size of 3 billion. In our experiments, we found that ESVI not only outperforms VI and SVI in wallclock-time, but also achieves a better quality solution. In addition, we propose a strategy to speed up computation and save memory when fitting large number of topics.
LGApr 16, 2016
DS-MLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic RegressionParameswaran Raman, Sriram Srinivasan, Shin Matsushima et al.
Scaling multinomial logistic regression to datasets with very large number of data points and classes is challenging. This is primarily because one needs to compute the log-partition function on every data point. This makes distributing the computation hard. In this paper, we present a distributed stochastic gradient descent based optimization method (DS-MLR) for scaling up multinomial logistic regression problems to massive scale datasets without hitting any storage constraints on the data and model parameters. Our algorithm exploits double-separability, an attractive property that allows us to achieve both data as well as model parallelism simultaneously. In addition, we introduce a non-blocking and asynchronous variant of our algorithm that avoids bulk-synchronization. We demonstrate the versatility of DS-MLR to various scenarios in data and model parallelism, through an extensive empirical study using several real-world datasets. In particular, we demonstrate the scalability of DS-MLR by solving an extreme multi-class classification problem on the Reddit dataset (159 GB data, 358 GB parameters) where, to the best of our knowledge, no other existing methods apply.
LGNov 21, 2015
BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large VocabulariesShihao Ji, S. V. N. Vishwanathan, Nadathur Satish et al.
We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a new sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme. We also establish close connections between BlackOut, importance sampling, and noise contrastive estimation (NCE). Our experiments, on the recently released one billion word language modeling benchmark, demonstrate scalability and accuracy of BlackOut; we outperform the state-of-the art, and achieve the lowest perplexity scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single machine to train a RNNLM with a million word vocabulary and billions of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers.
CLJun 9, 2015
WordRank: Learning Word Embeddings via Robust RankingShihao Ji, Hyokun Yun, Pinar Yanardag et al.
Embedding words in a vector space has gained a lot of attention in recent years. While state-of-the-art methods provide efficient computation of word similarities via a low-dimensional matrix embedding, their motivation is often left unclear. In this paper, we argue that word embedding can be naturally viewed as a ranking problem due to the ranking nature of the evaluation metrics. Then, based on this insight, we propose a novel framework WordRank that efficiently estimates word representations via robust ranking, in which the attention mechanism and robustness to noise are readily achieved via the DCG-like ranking losses. The performance of WordRank is measured in word similarity and word analogy benchmarks, and the results are compared to the state-of-the-art word embedding techniques. Our algorithm is very competitive to the state-of-the- arts on large corpora, while outperforms them by a significant margin when the training set is limited (i.e., sparse and noisy). With 17 million tokens, WordRank performs almost as well as existing methods using 7.2 billion tokens on a popular word similarity benchmark. Our multi-node distributed implementation of WordRank is publicly available for general usage.
HCMay 25, 2015
Colors $-$Messengers of Concepts: Visual Design Mining for Learning Color SemanticsAli Jahanian, S. V. N. Vishwanathan, Jan P. Allebach
This paper studies the concept of color semantics by modeling a dataset of magazine cover designs, evaluating the model via crowdsourcing, and demonstrating several prototypes that facilitate color-related design tasks. We investigate a probabilistic generative modeling framework that expresses semantic concepts as a combination of color and word distributions $-$color-word topics. We adopt an extension to Latent Dirichlet Allocation (LDA) topic modeling called LDA-dual to infer a set of color-word topics over a corpus of 2,654 magazine covers spanning 71 distinct titles and 12 genres. While LDA models text documents as distributions over word topics, we model magazine covers as distributions over color-word topics. The results of our crowdsourced experiments confirm that the model is able to successfully discover the associations between colors and linguistic concepts. Finally, we demonstrate several simple prototypes that apply the learned model to color palette recommendation, design example retrieval, image retrieval, image color selection, and image recoloring.
LGApr 7, 2015
Totally Corrective Boosting with Cardinality PenalizationVasil S. Denchev, Nan Ding, Shin Matsushima et al.
We propose a totally corrective boosting algorithm with explicit cardinality regularization. The resulting combinatorial optimization problems are not known to be efficiently solvable with existing classical methods, but emerging quantum optimization technology gives hope for achieving sparser models in practice. In order to demonstrate the utility of our algorithm, we use a distributed classical heuristic optimizer as a stand-in for quantum hardware. Even though this evaluation methodology incurs large time and resource costs on classical computing machinery, it allows us to gauge the potential gains in generalization performance and sparsity of the resulting boosted ensembles. Our experimental results on public data sets commonly used for benchmarking of boosting algorithms decidedly demonstrate the existence of such advantages. If actual quantum optimization were to be used with this algorithm in the future, we would expect equivalent or superior results at much smaller time and energy costs during training. Moreover, studying cardinality-penalized boosting also sheds light on why unregularized boosting algorithms with early stopping often yield better results than their counterparts with explicit convex regularization: Early stopping performs suboptimal cardinality regularization. The results that we present here indicate it is beneficial to explicitly solve the combinatorial problem still left open at early termination.
MLJun 17, 2014
DFacTo: Distributed Factorization of TensorsJoon Hee Choi, S. V. N. Vishwanathan
We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two sparse matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million x 2.5 million x 1.5 million dimensional tensor with 1.2 billion non-zero entries.
MLJun 17, 2014
Distributed Stochastic Optimization of the Regularized RiskShin Matsushima, Hyokun Yun, Xinhua Zhang et al.
Many machine learning algorithms minimize a regularized risk, and stochastic optimization is widely used for this task. When working with massive data, it is desirable to perform stochastic optimization in parallel. Unfortunately, many existing stochastic optimization algorithms cannot be parallelized efficiently. In this paper we show that one can rewrite the regularized risk minimization problem as an equivalent saddle-point problem, and propose an efficient distributed stochastic optimization (DSO) algorithm. We prove the algorithm's rate of convergence; remarkably, our analysis shows that the algorithm scales almost linearly with the number of processors. We also verify with empirical evaluations that the proposed algorithm is competitive with other parallel, general purpose stochastic and batch optimization algorithms for regularized risk minimization.
LGMar 3, 2014
The Structurally Smoothed Graphlet KernelPinar Yanardag, S. V. N. Vishwanathan
A commonly used paradigm for representing graphs is to use a vector that contains normalized frequencies of occurrence of certain motifs or sub-graphs. This vector representation can be used in a variety of applications, such as, for computing similarity between graphs. The graphlet kernel of Shervashidze et al. [32] uses induced sub-graphs of k nodes (christened as graphlets by Przulj [28]) as motifs in the vector representation, and computes the kernel via a dot product between these vectors. One can easily show that this is a valid kernel between graphs. However, such a vector representation suffers from a few drawbacks. As k becomes larger we encounter the sparsity problem; most higher order graphlets will not occur in a given graph. This leads to diagonal dominance, that is, a given graph is similar to itself but not to any other graph in the dataset. On the other hand, since lower order graphlets tend to be more numerous, using lower values of k does not provide enough discrimination ability. We propose a smoothing technique to tackle the above problems. Our method is based on a novel extension of Kneser-Ney and Pitman-Yor smoothing techniques from natural language processing to graphs. We use the relationships between lower order and higher order graphlets in order to derive our method. Consequently, our smoothing algorithm not only respects the dependency between sub-graphs but also tackles the diagonal dominance problem by distributing the probability mass across graphlets. In our experiments, the smoothed graphlet kernel outperforms graph kernels based on raw frequency counts.
MLFeb 11, 2014
Ranking via Robust Binary Classification and Parallel Parameter Estimation in Large-Scale DataHyokun Yun, Parameswaran Raman, S. V. N. Vishwanathan
We propose RoBiRank, a ranking algorithm that is motivated by observing a close connection between evaluation metrics for learning to rank and loss functions for robust classification. The algorithm shows a very competitive performance on standard benchmark datasets against other representative algorithms in the literature. On the other hand, in large scale problems where explicit feature vectors and scores are not given, our algorithm can be efficiently parallelized across a large number of machines; for a task that requires 386,133 x 49,824,519 pairwise interactions between items to be ranked, our algorithm finds solutions that are of dramatically higher quality than that can be found by a state-of-the-art competitor algorithm, given the same amount of wall-clock time for computation.
IRJan 1, 2014
Modeling Attractiveness and Multiple Clicks in Sponsored Search ResultsDinesh Govindaraj, Tao Wang, S. V. N. Vishwanathan
Click models are an important tool for leveraging user feedback, and are used by commercial search engines for surfacing relevant search results. However, existing click models are lacking in two aspects. First, they do not share information across search results when computing attractiveness. Second, they assume that users interact with the search results sequentially. Based on our analysis of the click logs of a commercial search engine, we observe that the sequential scan assumption does not always hold, especially for sponsored search results. To overcome the above two limitations, we propose a new click model. Our key insight is that sharing information across search results helps in identifying important words or key-phrases which can then be used to accurately compute attractiveness of a search result. Furthermore, we argue that the click probability of a position as well as its attractiveness changes during a user session and depends on the user's past click experience. Our model seamlessly incorporates the effect of externalities (quality of other search results displayed in response to a user query), user fatigue, as well as pre and post-click relevance of a sponsored search result. We propose an efficient one-pass inference scheme and empirically evaluate the performance of our model via extensive experiments using the click logs of a large commercial search engine.
MLFeb 27, 2012
Efficiently Sampling Multiplicative Attribute Graphs Using a Ball-Dropping ProcessHyokun Yun, S. V. N. Vishwanathan
We introduce a novel and efficient sampling algorithm for the Multiplicative Attribute Graph Model (MAGM - Kim and Leskovec (2010)}). Our algorithm is \emph{strictly} more efficient than the algorithm proposed by Yun and Vishwanathan (2012), in the sense that our method extends the \emph{best} time complexity guarantee of their algorithm to a larger fraction of parameter space. Both in theory and in empirical evaluation on sparse graphs, our new algorithm outperforms the previous one. To design our algorithm, we first define a stochastic \emph{ball-dropping process} (BDP). Although a special case of this process was introduced as an efficient approximate sampling algorithm for the Kronecker Product Graph Model (KPGM - Leskovec et al. (2010)}), neither \emph{why} such an approximation works nor \emph{what} is the actual distribution this process is sampling from has been addressed so far to the best of our knowledge. Our rigorous treatment of the BDP enables us to clarify the rational behind a BDP approximation of KPGM, and design an efficient sampling algorithm for the MAGM.