Koby Crammer

h-index40

21papers

445citations

Novelty51%

AI Score27

Ranked #154,776 of 194,257 authors (top 80%)#33,986 in LG (top 85%)

21 Papers

15.1LGMay 28, 2021Code

Weighted Training for Cross-Task Learning

Shuxiao Chen, Koby Crammer, Hangfeng He et al.

In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representation-based task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (PoS) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning.

1.0LGJun 13, 2019

Finite Sample Analysis Of Dynamic Regression Parameter Learning

Mark Kozdoba, Edward Moroshko, Shie Mannor et al.

We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output. We obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator, and thus provide the first known explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.

0.8LGDec 17, 2018

Multi Instance Learning For Unbalanced Data

Mark Kozdoba, Edward Moroshko, Lior Shani et al.

In the context of Multi Instance Learning, we analyze the Single Instance (SI) learning objective. We show that when the data is unbalanced and the family of classifiers is sufficiently rich, the SI method is a useful learning algorithm. In particular, we show that larger data imbalance, a quality that is typically perceived as negative, in fact implies a better resilience of the algorithm to the statistical dependencies of the objects in bags. In addition, our results shed new light on some known issues with the SI method in the setting of linear classifiers, and we show that these issues are significantly less likely to occur in the setting of neural networks. We demonstrate our results on a synthetic dataset, and on the COCO dataset for the problem of patch classification with weak image level labels derived from captions.

4.7LGMar 28, 2018

A Better Resource Allocation Algorithm with Semi-Bandit Feedback

Yuval Dagan, Koby Crammer

We study a sequential resource allocation problem between a fixed number of arms. On each iteration the algorithm distributes a resource among the arms in order to maximize the expected success rate. Allocating more of the resource to a given arm increases the probability that it succeeds, yet with a cut-off. We follow Lattimore et al. (2014) and assume that the probability increases linearly until it equals one, after which allocating more of the resource is wasteful. These cut-off values are fixed and unknown to the learner. We present an algorithm for this problem and prove a regret upper bound of $O(\log n)$ improving over the best known bound of $O(\log^2 n)$. Lower bounds we prove show that our upper bound is tight. Simulations demonstrate the superiority of our algorithm.

7.5LGMar 8, 2018Code

Efficient Loss-Based Decoding on Graphs For Extreme Classification

Itay Evron, Edward Moroshko, Koby Crammer

In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set. We build on a recent extreme classification framework with logarithmic time and space, and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds. Our framework employs output codes induced by graphs, for which we show how to perform efficient loss-based decoding to potentially improve accuracy. In addition, our framework offers a tradeoff between accuracy, model size and prediction time. We show how to find the sweet spot of this tradeoff using only the training data. Our experimental study demonstrates the validity of our assumptions and claims, and shows that our method is competitive with state-of-the-art algorithms.

20.4MLFeb 23, 2017

Rotting Bandits

Nir Levine, Koby Crammer, Shie Mannor

The Multi-Armed Bandits (MAB) framework highlights the tension between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.

1.0LGJan 31, 2016

Bandits meet Computer Architecture: Designing a Smartly-allocated Cache

Yonatan Glassner, Koby Crammer

In many embedded systems, such as imaging sys- tems, the system has a single designated purpose, and same threads are executed repeatedly. Profiling thread behavior, allows the system to allocate each thread its resources in a way that improves overall system performance. We study an online resource al- locationproblem,wherearesourcemanagersimulta- neously allocates resources (exploration), learns the impact on the different consumers (learning) and im- proves allocation towards optimal performance (ex- ploitation). We build on the rich framework of multi- armed bandits and present online and offline algo- rithms. Through extensive experiments with both synthetic data and real-world cache allocation to threads we show the merits and properties of our al- gorithms

10.0LGNov 4, 2015

Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests

Noam Segev, Maayan Harel, Shie Mannor et al.

We propose novel model transfer-learning methods that refine a decision forest model M learned within a "source" domain using a training set sampled from a "target" domain, assumed to be a variation of the source. We present two random forest transfer algorithms. The first algorithm searches greedily for locally optimal modifications of each tree structure by trying to locally expand or reduce the tree around individual nodes. The second algorithm does not modify structure, but only the parameter (thresholds) associated with decision nodes. We also propose to combine both methods by considering an ensemble that contains the union of the two forests. The proposed methods exhibit impressive experimental results over a range of problems.

1.1LGOct 30, 2015

CONQUER: Confusion Queried Online Bandit Learning

Daniel Barsky, Koby Crammer

We present a new recommendation setting for picking out two items from a given set to be highlighted to a user, based on contextual input. These two items are presented to a user who chooses one of them, possibly stochastically, with a bias that favours the item with the higher value. We propose a second-order algorithm framework that members of it use uses relative upper-confidence bounds to trade off exploration and exploitation, and some explore via sampling. We analyze one algorithm in this framework in an adversarial setting with only mild assumption on the data, and prove a regret bound of $O(Q_T + \sqrt{TQ_T\log T} + \sqrt{T}\log T)$, where $T$ is the number of rounds and $Q_T$ is the cumulative approximation error of item values using a linear model. Experiments with product reviews from 33 domains show the advantage of our methods over algorithms designed for related settings, and that UCB based algorithms are inferior to greed or sampling based algorithms.

1.5MLMay 26, 2015

Belief Flows of Robust Online Learning

Pedro A. Ortega, Koby Crammer, Daniel D. Lee

This paper introduces a new probabilistic model for online learning which dynamically incorporates information from stochastic gradients of an arbitrary loss function. Similar to probabilistic filtering, the model maintains a Gaussian belief over the optimal weight parameters. Unlike traditional Bayesian updates, the model incorporates a small number of gradient evaluations at locations chosen using Thompson sampling, making it computationally tractable. The belief is then transformed via a linear flow field which optimally updates the belief distribution using rules derived from information theoretic principles. Several versions of the algorithm are shown using different constraints on the flow field and compared with conventional online learning algorithms. Results are given for several classification tasks including logistic regression and multilayer neural networks.

1.4LGNov 17, 2014

Outlier-Robust Convex Segmentation

Itamar Katz, Koby Crammer

We derive a convex optimization problem for the task of segmenting sequential data, which explicitly treats presence of outliers. We describe two algorithms for solving this problem, one exact and one a top-down novel approach, and we derive a consistency results for the case of two segments and no outliers. Robustness to outliers is evaluated on two real-world tasks related to speech segmentation. Our algorithms outperform baseline segmentation algorithms.

12.6LGJun 15, 2014

Optimal Resource Allocation with Semi-Bandit Feedback

Tor Lattimore, Koby Crammer, Csaba Szepesvári

We study a sequential resource allocation problem involving a fixed number of recurring jobs. At each time-step the manager should distribute available resources among the jobs in order to maximise the expected number of completed jobs. Allocating more resources to a given job increases the probability that it completes, but with a cut-off. Specifically, we assume a linear model where the probability increases linearly until it equals one, after which allocating additional resources is wasteful. We assume the difficulty of each job is unknown and present the first algorithm for this problem and prove upper and lower bounds on its regret. Despite its apparent simplicity, the problem has a rich structure: we show that an appropriate optimistic algorithm can improve its learning speed dramatically beyond the results one normally expects for similar problems as the problem becomes resource-laden.

1.4LGFeb 17, 2014

Selective Sampling with Drift

Edward Moroshko, Koby Crammer

Recently there has been much work on selective sampling, an online active learning setting, in which algorithms work in rounds. On each round an algorithm receives an input and makes a prediction. Then, it can decide whether to query a label, and if so to update its model, otherwise the input is discarded. Most of this work is focused on the stationary case, where it is assumed that there is a fixed target model, and the performance of the algorithm is compared to a fixed model. However, in many real-world applications, such as spam prediction, the best target function may drift over time, or have shifts from time to time. We develop a novel selective sampling algorithm for the drifting setting, analyze it under no assumptions on the mechanism generating the sequence of instances, and derive new mistake bounds that depend on the amount of drift in the problem. Simulations on synthetic and real-world datasets demonstrate the superiority of our algorithms as a selective sampling algorithm in the drifting setting.

7.2LGApr 12, 2013

Advice-Efficient Prediction with Expert Advice

Yevgeny Seldin, Peter Bartlett, Koby Crammer

Advice-efficient prediction with expert advice (in analogy to label-efficient prediction) is a variant of prediction with expert advice game, where on each round of the game we are allowed to ask for advice of a limited number $M$ out of $N$ experts. This setting is especially interesting when asking for advice of every expert on every round is expensive. We present an algorithm for advice-efficient prediction with expert advice that achieves $O(\sqrt{\frac{N}{M}T\ln N})$ regret on $T$ rounds of the game.

20.7LGApr 10, 2013

A Generalized Online Mirror Descent with Applications to Classification and Regression

Francesco Orabona, Koby Crammer, Nicolò Cesa-Bianchi

Online learning algorithms are fast, memory-efficient, easy to implement, and applicable to many prediction problems, including classification, regression, and ranking. Several online algorithms were proposed in the past few decades, some based on additive updates, like the Perceptron, and some on multiplicative updates, like Winnow. A unifying perspective on the design and the analysis of online algorithms is provided by online mirror descent, a general prediction strategy from which most first-order algorithms can be obtained as special cases. We generalize online mirror descent to time-varying regularizers with generic updates. Unlike standard mirror descent, our more general formulation also captures second order algorithms, algorithms for composite losses and algorithms for adaptive filtering. Moreover, we recover, and sometimes improve, known regret bounds as special cases of our analysis using specific regularizers. Finally, we show the power of our approach by deriving a new second order algorithm with a regret bound invariant with respect to arbitrary rescalings of individual features.

4.2LGMar 15, 2013

A Last-Step Regression Algorithm for Non-Stationary Online Learning

Edward Moroshko, Koby Crammer

The goal of a learner in standard online learning is to maintain an average loss close to the loss of the best-performing single function in some class. In many real-world problems, such as rating or ranking items, there is no single best target function during the runtime of the algorithm, instead the best (local) target function is drifting over time. We develop a novel last-step minmax optimal algorithm in context of a drift. We analyze the algorithm in the worst-case regret framework and show that it maintains an average loss close to that of the best slowly changing sequence of linear functions, as long as the total of drift is sublinear. In some situations, our bound improves over existing bounds, and additionally the algorithm suffers logarithmic regret when there is no drift. We also build on the H_infinity filter and its bound, and develop and analyze a second algorithm for drifting setting. Synthetic simulations demonstrate the advantages of our algorithms in a worst-case constant drift setting.

8.0LGMar 1, 2013

Second-Order Non-Stationary Online Learning for Regression

Nina Vaits, Edward Moroshko, Koby Crammer

The goal of a learner, in standard online learning, is to have the cumulative loss not much larger compared with the best-performing function from some fixed class. Numerous algorithms were shown to have this gap arbitrarily close to zero, compared with the best function that is chosen off-line. Nevertheless, many real-world applications, such as adaptive filtering, are non-stationary in nature, and the best prediction function may drift over time. We introduce two novel algorithms for online regression, designed to work well in non-stationary environment. Our first algorithm performs adaptive resets to forget the history, while the second is last-step min-max optimal in context of a drift. We analyze both algorithms in the worst-case regret framework and show that they maintain an average loss close to that of the best slowly changing sequence of linear functions, as long as the cumulative drift is sublinear. In addition, in the stationary case, when no drift occurs, our algorithms suffer logarithmic regret, as for previous algorithms. Our bounds improve over the existing ones, and simulations demonstrate the usefulness of these algorithms compared with other state-of-the-art approaches.

2.9LGJan 25, 2013

Weighted Last-Step Min-Max Algorithm with Improved Sub-Logarithmic Regret

Edward Moroshko, Koby Crammer

In online learning the performance of an algorithm is typically compared to the performance of a fixed function from some class, with a quantity called regret. Forster proposed a last-step min-max algorithm which was somewhat simpler than the algorithm of Vovk, yet with the same regret. In fact the algorithm he analyzed assumed that the choices of the adversary are bounded, yielding artificially only the two extreme cases. We fix this problem by weighing the examples in such a way that the min-max problem will be well defined, and provide analysis with logarithmic regret that may have better multiplicative factor than both bounds of Forster and Vovk. We also derive a new bound that may be sub-logarithmic, as a recent bound of Orabona et.al, but may have better multiplicative factor. Finally, we analyze the algorithm in a weak-type of non-stationary setting, and show a bound that is sub-linear if the non-stationarity is sub-linear as well.

1.4LGSep 27, 2012

More Is Better: Large Scale Partially-supervised Sentiment Classification - Appendix

Yoav Haimovitch, Koby Crammer, Shie Mannor

We describe a bootstrapping algorithm to learn from partially labeled data, and the results of an empirical study for using it to improve performance of sentiment classification using up to 15 million unlabeled Amazon product reviews. Our experiments cover semi-supervised learning, domain adaptation and weakly supervised learning. In some cases our methods were able to reduce test error by more than half using such large amount of data. NOTICE: This is only the supplementary material.

3.8LGJun 27, 2012

Discriminative Learning via Semidefinite Probabilistic Models

Koby Crammer, Amir Globerson

Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: The first is linear classifiers, such as support vector machines, which are well studied and provide state-of-the-art results. One shortcoming of these models is that their output (known as the 'margin') is not calibrated, and cannot be translated naturally into a distribution over the labels. Thus, it is difficult to incorporate such models as components of larger systems, unlike probabilistic based approaches. The second type of approach constructs class conditional distributions using a nonlinearity (e.g. log-linear models), but is occasionally worse in terms of classification error. We propose a supervised learning method which combines the best of both approaches. Specifically, our method provides a distribution over the labels, which is a linear function of the model parameters. As a consequence, differences between probabilities are linear functions, a property which most probabilistic models (e.g. log-linear) do not have. Our model assumes that classes correspond to linear subspaces (rather than to half spaces). Using a relaxed projection operator, we construct a measure which evaluates the degree to which a given vector 'belongs' to a subspace, resulting in a distribution over labels. Interestingly, this view is closely related to similar concepts in quantum detection theory. The resulting models can be trained either to maximize the margin or to optimize average likelihood measures. The corresponding optimization problems are semidefinite programs which can be solved efficiently. We illustrate the performance of our algorithm on real world datasets, and show that it outperforms 2nd order kernel methods.

2.7LGJun 18, 2012

Adaptive Regularization for Weight Matrices

Koby Crammer, Gal Chechik

Algorithms for learning distributions over weight-vectors, such as AROW were recently shown empirically to achieve state-of-the-art performance at various problems, with strong theoretical guaranties. Extending these algorithms to matrix models pose challenges since the number of free parameters in the covariance of the distribution scales as $n^4$ with the dimension $n$ of the matrix, and $n$ tends to be large in real applications. We describe, analyze and experiment with two new algorithms for learning distribution of matrix models. Our first algorithm maintains a diagonal covariance over the parameters and can handle large covariance matrices. The second algorithm factors the covariance to capture inter-features correlation while keeping the number of parameters linear in the size of the original matrix. We analyze both algorithms in the mistake bound model and show a superior precision performance of our approach over other algorithms in two tasks: retrieving similar images, and ranking similar documents. The factored algorithm is shown to attain faster convergence rate.