MLJun 7, 2022
Learning Backward Compatible EmbeddingsWeihua Hu, Rajas Bansal, Kaidi Cao et al. · stanford
Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice. Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task.
CLOct 3, 2023
Dodo: Dynamic Contextual Compression for Decoder-only LMsGuanghui Qin, Corby Rosset, Ethan C. Chau et al. · microsoft-research
Transformer-based language models (LMs) are inefficient in long contexts. We propose Dodo, a solution for context compression. Instead of one vector per token in a standard transformer model, Dodo represents text with a dynamic number of hidden states at each layer, reducing the cost of self-attention to a fraction of typical time and space. Moreover, off-the-shelf models such as LLaMA can be adapted to Dodo by efficient parameter tuning methods such as LoRA. In use, Dodo can act as either an autoregressive LM or a context compressor for downstream tasks. We demonstrate through experiments in language modeling, question answering, and summarization that Dodo retains capabilities in these tasks, while drastically reducing the overhead during decoding. For example, in the autoencoding task, Dodo shrinks context at a 20x compression ratio with a BLEU score of 98% for reconstruction, achieving nearly lossless encoding.
LGFeb 27, 2023Code
You Only Transfer What You Share: Intersection-Induced Graph Transfer Learning for Link PredictionWenqing Zheng, Edward W Huang, Nikhil Rao et al.
Link prediction is central to many real-world applications, but its performance may be hampered when the graph of interest is sparse. To alleviate issues caused by sparsity, we investigate a previously overlooked phenomenon: in many cases, a densely connected, complementary graph can be found for the original graph. The denser graph may share nodes with the original graph, which offers a natural bridge for transferring selective, meaningful knowledge. We identify this setting as Graph Intersection-induced Transfer Learning (GITL), which is motivated by practical applications in e-commerce or academic co-authorship predictions. We develop a framework to effectively leverage the structural prior in this setting. We first create an intersection subgraph using the shared nodes between the two graphs, then transfer knowledge from the source-enriched intersection subgraph to the full target graph. In the second step, we consider two approaches: a modified label propagation, and a multi-layer perceptron (MLP) model in a teacher-student regime. Experimental results on proprietary e-commerce datasets and open-source citation graphs show that the proposed workflow outperforms existing transfer learning baselines that do not explicitly utilize the intersection structure.
AIJun 4
Multi-ResNets for Subspace Preconditioning in Constrained OptimizationMerve Karakas, Christopher J. Williams, Emmanuel O. Balogun et al.
We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.
IRJun 14, 2022
Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product SearchChandan K. Reddy, Lluís Màrquez, Fran Valero et al.
Improving the quality of search results can significantly enhance users experience and engagement with search engines. In spite of several recent advancements in the fields of machine learning and data mining, correctly classifying items for a particular user search query has been a long-standing challenge, which still has a large room for improvement. This paper introduces the "Shopping Queries Dataset", a large dataset of difficult Amazon search queries and results, publicly released with the aim of fostering research in improving the quality of search results. The dataset contains around 130 thousand unique queries and 2.6 million manually labeled (query,product) relevance judgements. The dataset is multilingual with queries in English, Japanese, and Spanish. The Shopping Queries Dataset is being used in one of the KDDCup'22 challenges. In this paper, we describe the dataset and present three evaluation tasks along with baseline results: (i) ranking the results list, (ii) classifying product results into relevance categories, and (iii) identifying substitute products for a given query. We anticipate that this data will become the gold standard for future research in the topic of product search.
CLOct 3, 2023
Automatic Pair Construction for Contrastive Post-trainingCanwen Xu, Corby Rosset, Ethan C. Chau et al.
Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, our automatic contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to outperform ChatGPT.
IRJul 6, 2022
Text Enriched Sparse Hyperbolic Graph Convolutional NetworksNurendra Choudhary, Nikhil Rao, Karthik Subbian et al.
Heterogeneous networks, which connect informative nodes containing text with different edge types, are routinely used to store and process information in various real-world applications. Graph Neural Networks (GNNs) and their hyperbolic variants provide a promising approach to encode such networks in a low-dimensional latent space through neighborhood aggregation and hierarchical feature extraction, respectively. However, these approaches typically ignore metapath structures and the available semantic information. Furthermore, these approaches are sensitive to the noise present in the training data. To tackle these limitations, in this paper, we propose Text Enriched Sparse Hyperbolic Graph Convolution Network (TESH-GCN) to capture the graph's metapath structures using semantic signals and further improve prediction in large heterogeneous graphs. In TESH-GCN, we extract semantic node information, which successively acts as a connection signal to extract relevant nodes' local neighborhood and graph-level metapath features from the sparse adjacency tensor in a reformulated hyperbolic graph convolution layer. These extracted features in conjunction with semantic features from the language model (for robustness) are used for the final downstream task. Experiments on various heterogeneous graph datasets show that our model outperforms the current state-of-the-art approaches by a large margin on the task of link prediction. We also report a reduction in both the training time and model parameters compared to the existing hyperbolic approaches through a reformulated hyperbolic graph convolution. Furthermore, we illustrate the robustness of our model by experimenting with different levels of simulated noise in both the graph structure and text, and also, present a mechanism to explain TESH-GCN's prediction by analyzing the extracted metapaths.
LGOct 29, 2023
Hyperbolic Graph Neural Networks at Scale: A Meta Learning ApproachNurendra Choudhary, Nikhil Rao, Chandan K. Reddy
The progress in hyperbolic neural networks (HNNs) research is hindered by their absence of inductive bias mechanisms, which are essential for generalizing to new tasks and facilitating scalable learning over large datasets. In this paper, we aim to alleviate these issues by learning generalizable inductive biases from the nodes' local subgraph and transfer them for faster learning over new subgraphs with a disjoint set of nodes, edges, and labels in a few-shot setting. We introduce a novel method, Hyperbolic GRAph Meta Learner (H-GRAM), that, for the tasks of node classification and link prediction, learns transferable information from a set of support local subgraphs in the form of hyperbolic meta gradients and label hyperbolic protonets to enable faster learning over a query set of new tasks dealing with disjoint subgraphs. Furthermore, we show that an extension of our meta-learning framework also mitigates the scalability challenges seen in HNNs faced by existing approaches. Our comparative analysis shows that H-GRAM effectively learns and transfers information in multiple challenging few-shot settings compared to other state-of-the-art baselines. Additionally, we demonstrate that, unlike standard HNNs, our approach is able to scale over large graph datasets and improve performance over its Euclidean counterparts.
LGFeb 16, 2022Code
Task-Agnostic Graph ExplanationsYaochen Xie, Sumeet Katariya, Xianfeng Tang et al.
Graph Neural Networks (GNNs) have emerged as powerful tools to encode graph-structured data. Due to their broad applications, there is an increasing need to develop tools to explain how GNNs make decisions given graph-structured data. Existing learning-based GNN explanation approaches are task-specific in training and hence suffer from crucial drawbacks. Specifically, they are incapable of producing explanations for a multitask prediction model with a single explainer. They are also unable to provide explanations in cases where the GNN is trained in a self-supervised manner, and the resulting representations are used in future downstream tasks. To address these limitations, we propose a Task-Agnostic GNN Explainer (TAGE) that is independent of downstream models and trained under self-supervision with no knowledge of downstream tasks. TAGE enables the explanation of GNN embedding models with unseen downstream tasks and allows efficient explanation of multitask models. Our extensive experiments show that TAGE can significantly speed up the explanation efficiency by using the same model to explain predictions for multiple downstream tasks while achieving explanation quality as good as or even better than current state-of-the-art GNN explanation approaches. Our code is pubicly available as part of the DIG library at https://github.com/divelab/DIG/tree/main/dig/xgraph/TAGE/.
LGNov 8, 2021Code
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing NeighborhoodsWenqing Zheng, Edward W Huang, Nikhil Rao et al.
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in node classification, regression, and recommendation tasks. GNNs work well when rich and high-quality connections are available. However, their effectiveness is often jeopardized in many real-world graphs in which node degrees have power-law distributions. The extreme case of this situation, where a node may have no neighbors, is called Strict Cold Start (SCS). SCS forces the prediction to rely completely on the node's own features. We propose Cold Brew, a teacher-student distillation approach to address the SCS and noisy-neighbor challenges for GNNs. We also introduce feature contribution ratio (FCR), a metric to quantify the behavior of inductive GNNs to solve SCS. We experimentally show that FCR disentangles the contributions of different graph data components and helps select the best architecture for SCS generalization. We further demonstrate the superior performance of Cold Brew on several public benchmark and proprietary e-commerce datasets, where many nodes have either very few or noisy connections. Our source code is available at https://github.com/amazon-research/gnn-tail-generalization.
CLFeb 27, 2024
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web AgentsCorby Rosset, Ho-Lam Chung, Guanghui Qin et al. · microsoft-research
Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study ``known unknowns'' with clear indications of both what information is missing, and how to find it to answer the question. Hence, good performance on these benchmarks provides a false sense of security. A yet unmet need of the NLP community is a bank of non-factoid, multi-perspective questions involving a great deal of unclear information needs, i.e. ``unknown uknowns''. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. We present Researchy Questions, a dataset of search engine queries tediously filtered to be non-factoid, ``decompositional'' and multi-perspective. We show that users spend a lot of ``effort'' on these questions in terms of signals like clicks and session length, and that they are also challenging for GPT-4. We also show that ``slow thinking'' answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release $\sim$ 100k Researchy Questions, along with the Clueweb22 URLs that were clicked.
LGMay 17, 2023
Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model AggregationJiong Zhu, Aishwarya Reganti, Edward Huang et al.
Distributed training of GNNs enables learning on massive graphs (e.g., social and e-commerce networks) that exceed the storage and computational capacity of a single machine. To reach performance comparable to centralized training, distributed frameworks focus on maximally recovering cross-instance node dependencies with either communication across instances or periodic fallback to centralized training, which create overhead and limit the framework scalability. In this work, we present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations, and has improved scalability, convergence speed and performance over the state-of-the-art approaches. Specifically, our framework (1) assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph, and (2) only conducts periodic (time-based) model aggregation to synchronize the local models. Backed by our theoretical analysis, instead of maximizing the recovery of cross-instance node dependencies -- which has been considered the key behind closing the performance gap between model aggregation and centralized training -- , our framework leverages randomized assignment of nodes or super-nodes (i.e., collections of original nodes) to partition the training graph such that it improves data uniformity and minimizes the discrepancy of gradient and loss function across instances. In our experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches -- despite using less training data -- achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline, and show better robustness to trainer failures.
LGOct 26, 2021
Cluster-and-Conquer: A Framework For Time-Series ForecastingReese Pathak, Rajat Sen, Nikhil Rao et al.
We propose a three-stage framework for forecasting high-dimensional time-series data. Our method first estimates parameters for each univariate time series. Next, we use these parameters to cluster the time series. These clusters can be viewed as multivariate time series, for which we then compute parameters. The forecasted values of a single time series can depend on the history of other time series in the same cluster, accounting for intra-cluster similarity while minimizing potential noise in predictions by ignoring inter-cluster effects. Our framework -- which we refer to as "cluster-and-conquer" -- is highly general, allowing for any time-series forecasting and clustering method to be used in each step. It is computationally efficient and embarrassingly parallel. We motivate our framework with a theoretical analysis in an idealized mixed linear regression setting, where we provide guarantees on the quality of the estimates. We accompany these guarantees with experimental results that demonstrate the advantages of our framework: when instantiated with simple linear autoregressive models, we are able to achieve state-of-the-art results on several benchmark datasets, sometimes outperforming deep-learning-based approaches.
LGOct 26, 2021
Probabilistic Entity Representation Model for Reasoning over Knowledge GraphsNurendra Choudhary, Nikhil Rao, Sumeet Katariya et al.
Logical reasoning over Knowledge Graphs (KGs) is a fundamental technique that can provide efficient querying mechanism over large and incomplete databases. Current approaches employ spatial geometries such as boxes to learn query representations that encompass the answer entities and model the logical operations of projection and intersection. However, their geometry is restrictive and leads to non-smooth strict boundaries, which further results in ambiguous answer entities. Furthermore, previous works propose transformation tricks to handle unions which results in non-closure and, thus, cannot be chained in a stream. In this paper, we propose a Probabilistic Entity Representation Model (PERM) to encode entities as a Multivariate Gaussian density with mean and covariance parameters to capture its semantic position and smooth decision boundary, respectively. Additionally, we also define the closed logical operations of projection, intersection, and union that can be aggregated using an end-to-end objective function. On the logical query reasoning problem, we demonstrate that the proposed PERM significantly outperforms the state-of-the-art methods on various public benchmark KG datasets on standard evaluation metrics. We also evaluate PERM's competence on a COVID-19 drug-repurposing case study and show that our proposed work is able to recommend drugs with substantially better F1 than current methods. Finally, we demonstrate the working of our PERM's query answering process through a low-dimensional visualization of the Gaussian representations.
MLSep 5, 2021
Scalable Feature Selection for (Multitask) Gradient Boosted TreesCuize Han, Nikhil Rao, Daria Sorokina et al.
Gradient Boosted Decision Trees (GBDTs) are widely used for building ranking and relevance models in search and recommendation. Considerations such as latency and interpretability dictate the use of as few features as possible to train these models. Feature selection in GBDT models typically involves heuristically ranking the features by importance and selecting the top few, or by performing a full backward feature elimination routine. On-the-fly feature selection methods proposed previously scale suboptimally with the number of features, which can be daunting in high dimensional settings. We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. We show via extensive experiments on both public and proprietary datasets that the proposed method offers significant speedups in training time, while being as competitive as existing GBDT methods in terms of model performance metrics. We also extend the method to the multitask setting, allowing the practitioner to select common features across tasks, as well as selecting task-specific features.
LGAug 2, 2021
Maximizing and Satisficing in Multi-armed Bandits with Graph InformationParth K. Thaker, Mohit Malu, Nikhil Rao et al.
Pure exploration in multi-armed bandits has emerged as an important framework for modeling decision-making and search under uncertainty. In modern applications, however, one is often faced with a tremendously large number of options. Even obtaining one observation per option may be too costly rendering traditional pure exploration algorithms ineffective. Fortunately, one often has access to similar relationships amongst the options that can be leveraged. In this paper, we consider the pure exploration problem in stochastic multi-armed bandits where the similarities between the arms are captured by a graph and the rewards may be represented as a smooth signal on this graph. In particular, we consider the problem of finding the arm with the maximum reward (i.e., the maximizing problem) or one with a sufficiently high reward (i.e., the satisficing problem) under this model. We propose novel algorithms \textbf{\algoname{}} (GRaph-based UcB) and $ζ$-\textbf{\algoname{}} for these problems and provide a theoretical characterization of their performance which specifically elicits the benefit of the graph side information. We also prove a lower bound on the data requirement, showing a large class of problems where these algorithms are near-optimal. We complement our theory with experimental results that show the benefit of capitalizing on such side information.
LGDec 23, 2020
Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge GraphsNurendra Choudhary, Nikhil Rao, Sumeet Katariya et al.
Knowledge Graphs (KGs) are ubiquitous structures for information storagein several real-world applications such as web search, e-commerce, social networks, and biology. Querying KGs remains a foundational and challenging problem due to their size and complexity. Promising approaches to tackle this problem include embedding the KG units (e.g., entities and relations) in a Euclidean space such that the query embedding contains the information relevant to its results. These approaches, however, fail to capture the hierarchical nature and semantic information of the entities present in the graph. Additionally, most of these approaches only utilize multi-hop queries (that can be modeled by simple translation operations) to learn embeddings and ignore more complex operations such as intersection and union of simpler queries. To tackle such complex operations, in this paper, we formulate KG representation learning as a self-supervised logical query reasoning problem that utilizes translation, intersection and union queries over KGs. We propose Hyperboloid Embeddings (HypE), a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincaré ball. HypE models the positive first-order queries as geometrical translation, intersection, and union. For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. We also apply HypE to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, we also visualize the learned HypE embeddings in a Poincaré ball to clearly interpret and comprehend the representation space.
CLMay 7, 2020
Learning Robust Models for e-Commerce Product SearchThanh V. Nguyen, Nikhil Rao, Karthik Subbian
Showing items that do not match search query intent degrades customer experience in e-commerce. These mismatches result from counterfactual biases of the ranking algorithms toward noisy behavioral signals such as clicks and purchases in the search logs. Mitigating the problem requires a large labeled dataset, which is expensive and time-consuming to obtain. In this paper, we develop a deep, end-to-end model that learns to effectively classify mismatches and to generate hard mismatched examples to improve the classifier. We train the model end-to-end by introducing a latent variable into the cross-entropy loss that alternates between using the real and generated samples. This not only makes the classifier more robust but also boosts the overall ranking performance. Our model achieves a relative gain compared to baselines by over 26% in F-score, and over 17% in Area Under PR curve. On live search traffic, our model gains significant improvement in multiple countries.
LGMay 29, 2019
Graph DNA: Deep Neighborhood Aware Graph Encoding for Collaborative FilteringLiwei Wu, Hsiang-Fu Yu, Nikhil Rao et al.
In this paper, we consider recommender systems with side information in the form of graphs. Existing collaborative filtering algorithms mainly utilize only immediate neighborhood information and have a hard time taking advantage of deeper neighborhoods beyond 1-2 hops. The main caveat of exploiting deeper graph information is the rapidly growing time and space complexity when incorporating information from these neighborhoods. In this paper, we propose using Graph DNA, a novel Deep Neighborhood Aware graph encoding algorithm, for exploiting deeper neighborhood information. DNA encoding computes approximate deep neighborhood information in linear time using Bloom filters, a space-efficient probabilistic data structure and results in a per-node encoding that is logarithmic in the number of nodes in the graph. It can be used in conjunction with both feature-based and graph-regularization-based collaborative filtering algorithms. Graph DNA has the advantages of being memory and time efficient and providing additional regularization when compared to directly using higher order graph information. We conduct experiments on real-world datasets, showing graph DNA can be easily used with 4 popular collaborative filtering algorithms and consistently leads to a performance boost with little computational and memory overhead.
CLJul 6, 2017
A Simple Approach to Learn Polysemous Word EmbeddingsYifan Sun, Nikhil Rao, Weicong Ding
Many NLP applications require disambiguating polysemous words. Existing methods that learn polysemous word vector representations involve first detecting various senses and optimizing the sense-specific embeddings separately, which are invariably more involved than single sense learning methods such as word2vec. Evaluating these methods is also problematic, as rigorous quantitative evaluations in this space is limited, especially when compared with single-sense embeddings. In this paper, we propose a simple method to learn a word representation, given any context. Our method only requires learning the usual single sense representation, and coefficients that can be learnt via a single pass over the data. We propose several new test sets for evaluating word sense induction, relevance detection, and contextual word similarity, significantly supplementing the currently available tests. Results on these and other tests show that while our method is embarrassingly simple, it achieves excellent results when compared to the state of the art models for unsupervised polysemous word representation learning.
MLMay 4, 2017
Matrix Completion via Factorizing PolynomialsVatsal Shah, Nikhil Rao, Weicong Ding
Predicting unobserved entries of a partially observed matrix has found wide applicability in several areas, such as recommender systems, computational biology, and computer vision. Many scalable methods with rigorous theoretical guarantees have been developed for algorithms where the matrix is factored into low-rank components, and embeddings are learned for the row and column entities. While there has been recent research on incorporating explicit side information in the low-rank matrix factorization setting, often implicit information can be gleaned from the data, via higher-order interactions among entities. Such implicit information is especially useful in cases where the data is very sparse, as is often the case in real-world datasets. In this paper, we design a method to learn embeddings in the context of recommendation systems, using the observation that higher powers of a graph transition probability matrix encode the probability that a random walker will hit that node in a given number of steps. We develop a coordinate descent algorithm to solve the resulting optimization, that makes explicit computation of the higher order powers of the matrix redundant, preserving sparsity and making computations efficient. Experiments on several datasets show that our method, that can use higher order information, outperforms methods that only use explicitly available side information, those that use only second-order implicit information and in some cases, methods based on deep neural networks as well.
CLMar 2, 2017
Dynamic Word Embeddings for Evolving Semantic DiscoveryZijun Yao, Yifan Sun, Weicong Ding et al.
Word evolution refers to the changing meanings and associations of words throughout time, as a byproduct of human language evolution. By studying word evolution, we can infer social trends and language constructs over different periods of human history. However, traditional techniques such as word representation learning do not adequately capture the evolving language structure and vocabulary. In this paper, we develop a dynamic statistical model to learn time-aware word vector representation. We propose a model that simultaneously learns time-aware embeddings and solves the resulting "alignment problem". This model is trained on a crawled NYTimes dataset. Additionally, we develop multiple intuitive evaluation strategies of temporal word embeddings. Our qualitative and quantitative tests indicate that our method not only reliably captures this evolution over time, but also consistently outperforms state-of-the-art temporal embedding approaches on both semantic accuracy and alignment quality.
MLMar 13, 2016
On Learning High Dimensional Structured Single Index ModelsNikhil Rao, Ravi Ganti, Laura Balzano et al.
Single Index Models (SIMs) are simple yet flexible semi-parametric models for machine learning, where the response variable is modeled as a monotonic function of a linear combination of features. Estimation in this context requires learning both the feature weights and the nonlinear function that relates features to observations. While methods have been described to learn SIMs in the low dimensional regime, a method that can efficiently learn SIMs in high dimensions, and under general structural assumptions, has not been forthcoming. In this paper, we propose computationally efficient algorithms for SIM inference in high dimensions with structural constraints. Our general approach specializes to sparsity, group sparsity, and low-rank assumptions among others. Experiments show that the proposed method enjoys superior predictive performance when compared to generalized linear models, and achieves results comparable to or better than single layer feedforward neural networks with significantly less computational cost.
MLFeb 19, 2016
Structured Sparse Regression via Greedy Hard-ThresholdingPrateek Jain, Nikhil Rao, Inderjit Dhillon
Several learning applications require solving high-dimensional regression problems where the relevant features belong to a small number of (overlapping) groups. For very large datasets and under standard sparsity constraints, hard thresholding methods have proven to be extremely efficient, but such methods require NP hard projections when dealing with overlapping groups. In this paper, we show that such NP-hard projections can not only be avoided by appealing to submodular optimization, but such methods come with strong theoretical guarantees even in the presence of poorly conditioned data (i.e. say when two features have correlation $\geq 0.99$), which existing analyses cannot handle. These methods exhibit an interesting computation-accuracy trade-off and can be extended to significantly harder problems such as sparse overlapping groups. Experiments on both real and synthetic data validate our claims and demonstrate that the proposed methods are orders of magnitude faster than other greedy and convex relaxation techniques for learning with group-structured sparsity.
LGSep 28, 2015
High-dimensional Time Series Prediction with Missing ValuesHsiang-Fu Yu, Nikhil Rao, Inderjit S. Dhillon
High-dimensional time series prediction is needed in applications as diverse as demand forecasting and climatology. Often, such applications require methods that are both highly scalable, and deal with noisy data in terms of corruptions or missing values. Classical time series methods usually fall short of handling both these issues. In this paper, we propose to adapt matrix matrix completion approaches that have previously been successfully applied to large scale noisy data, but which fail to adequately model high-dimensional time series due to temporal dependencies. We present a novel temporal regularized matrix factorization (TRMF) framework which supports data-driven temporal dependency learning and enables forecasting ability to our new matrix factorization approach. TRMF is highly general, and subsumes many existing matrix factorization approaches for time series data. We make interesting connections to graph regularized matrix factorization methods in the context of learning the dependencies. Experiments on both real and synthetic data show that TRMF outperforms several existing approaches for common time series tasks.
MLJun 30, 2015
Learning Single Index Models in High DimensionsRavi Ganti, Nikhil Rao, Rebecca M. Willett et al.
Single Index Models (SIMs) are simple yet flexible semi-parametric models for classification and regression. Response variables are modeled as a nonlinear, monotonic function of a linear combination of features. Estimation in this context requires learning both the feature weights, and the nonlinear function. While methods have been described to learn SIMs in the low dimensional regime, a method that can efficiently learn SIMs in high dimensions has not been forthcoming. We propose three variants of a computationally and statistically efficient algorithm for SIM inference in high dimensions. We establish excess risk bounds for the proposed algorithms and experimentally validate the advantages that our SIM learning methods provide relative to Generalized Linear Model (GLM) and low dimensional SIM based learning methods.
MLMay 15, 2015
Optimal Low-Rank Tensor Recovery from Separable Measurements: Four Contractions SufficeParikshit Shah, Nikhil Rao, Gongguo Tang
Tensors play a central role in many modern machine learning and signal processing applications. In such applications, the target tensor is usually of low rank, i.e., can be expressed as a sum of a small number of rank one tensors. This motivates us to consider the problem of low rank tensor recovery from a class of linear measurements called separable measurements. As specific examples, we focus on two distinct types of separable measurement mechanisms (a) Random projections, where each measurement corresponds to an inner product of the tensor with a suitable random tensor, and (b) the completion problem where measurements constitute revelation of a random set of entries. We present a computationally efficient algorithm, with rigorous and order-optimal sample complexity results (upto logarithmic factors) for tensor recovery. Our method is based on reduction to matrix completion sub-problems and adaptation of Leurgans' method for tensor decomposition. We extend the methodology and sample complexity results to higher order tensors, and experimentally validate our theoretical results.
DSApr 23, 2014
Forward - Backward Greedy Algorithms for Atomic Norm RegularizationNikhil Rao, Parikshit Shah, Stephen Wright
In many signal processing applications, the aim is to reconstruct a signal that has a simple representation with respect to a certain basis or frame. Fundamental elements of the basis known as "atoms" allow us to define "atomic norms" that can be used to formulate convex regularizations for the reconstruction problem. Efficient algorithms are available to solve these formulations in certain special cases, but an approach that works well for general atomic norms, both in terms of speed and reconstruction accuracy, remains to be found. This paper describes an optimization algorithm called CoGEnT that produces solutions with succinct atomic representations for reconstruction problems, generally formulated with atomic-norm constraints. CoGEnT combines a greedy selection scheme based on the conditional gradient approach with a backward (or "truncation") step that exploits the quadratic nature of the objective to reduce the basis size. We establish convergence properties and validate the algorithm via extensive numerical experiments on a suite of signal processing applications. Our algorithm and analysis also allow for inexact forward steps and for occasional enhancements of the current representation to be performed. CoGEnT can outperform the basic conditional gradient method, and indeed many methods that are tailored to specific applications, when the enhancement and truncation steps are defined appropriately. We also introduce several novel applications that are enabled by the atomic-norm framework, including tensor completion, moment problems in signal processing, and graph deconvolution.
LGFeb 18, 2014
Classification with Sparse Overlapping GroupsNikhil Rao, Robert Nowak, Christopher Cox et al.
Classification with a sparsity constraint on the solution plays a central role in many high dimensional machine learning applications. In some cases, the features can be grouped together so that entire subsets of features can be selected or not selected. In many applications, however, this can be too restrictive. In this paper, we are interested in a less restrictive form of structured sparse feature selection: we assume that while features can be grouped according to some notion of similarity, not all features in a group need be selected for the task at hand. When the groups are comprised of disjoint sets of features, this is sometimes referred to as the "sparse group" lasso, and it allows for working with a richer class of models than traditional group lasso methods. Our framework generalizes conventional sparse group lasso further by allowing for overlapping groups, an additional flexiblity needed in many applications and one that presents further challenges. The main contribution of this paper is a new procedure called Sparse Overlapping Group (SOG) lasso, a convex optimization program that automatically selects similar features for classification in high dimensions. We establish model selection error bounds for SOGlasso classification problems under a fairly general setting. In particular, the error bounds are the first such results for classification using the sparse group lasso. Furthermore, the general SOGlasso bound specializes to results for the lasso and the group lasso, some known and some new. The SOGlasso is motivated by multi-subject fMRI studies in which functional activity is classified using brain voxels as features, source localization problems in Magnetoencephalography (MEG), and analyzing gene activation patterns in microarray data analysis. Experiments with real and synthetic data demonstrate the advantages of SOGlasso compared to the lasso and group lasso.
LGNov 20, 2013
Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI AnalysisNikhil Rao, Christopher Cox, Robert Nowak et al.
Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization that automatically selects similar features for related learning tasks. Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular, SOSlasso is motivated by multi- subject fMRI studies in which functional activity is classified using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso.
MLSep 14, 2012
Signal Recovery in Unions of Subspaces with Applications to Compressive ImagingNikhil Rao, Benjamin Recht, Robert Nowak
In applications ranging from communications to genetics, signals can be modeled as lying in a union of subspaces. Under this model, signal coefficients that lie in certain subspaces are active or inactive together. The potential subspaces are known in advance, but the particular set of subspaces that are active (i.e., in the signal support) must be learned from measurements. We show that exploiting knowledge of subspaces can further reduce the number of measurements required for exact signal recovery, and derive universal bounds for the number of measurements needed. The bound is universal in the sense that it only depends on the number of subspaces under consideration, and their orientation relative to each other. The particulars of the subspaces (e.g., compositions, dimensions, extents, overlaps, etc.) does not affect the results we obtain. In the process, we derive sample complexity bounds for the special case of the group lasso with overlapping groups (the latent group lasso), which is used in a variety of applications. Finally, we also show that wavelet transform coefficients of images can be modeled as lying in groups, and hence can be efficiently recovered using group lasso methods.