Ofer Meshi

LG
h-index15
17papers
400citations
Novelty56%
AI Score53

17 Papers

CLFeb 18
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi, Krisztian Balog, Sally Goldman et al. · amazon-science, deepmind

The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

IRSep 26, 2024
Minimizing Live Experiments in Recommender Systems: User Simulation to Evaluate Preference Elicitation Policies

Chih-Wei Hsu, Martin Mladenov, Ofer Meshi et al.

Evaluation of policies in recommender systems typically involves A/B testing using live experiments on real users to assess a new policy's impact on relevant metrics. This ``gold standard'' comes at a high cost, however, in terms of cycle time, user cost, and potential user retention. In developing policies for ``onboarding'' new users, these costs can be especially problematic, since on-boarding occurs only once. In this work, we describe a simulation methodology used to augment (and reduce) the use of live experiments. We illustrate its deployment for the evaluation of ``preference elicitation'' algorithms used to onboard new users of the YouTube Music platform. By developing counterfactually robust user behavior models, and a simulation service that couples such models with production infrastructure, we are able to test new algorithms in a way that reliably predicts their performance on key metrics when deployed live. We describe our domain, our simulation models and platform, results of experiments and deployment, and suggest future steps needed to further realistic simulation as a powerful complement to live experiments.

LGJan 25, 2023
Overcoming Prior Misspecification in Online Learning to Rank

Javad Azizi, Ofer Meshi, Masrour Zoghi et al.

The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.

IRApr 15
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

Julian Killingback, Ofer Meshi, Henry Li et al.

Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve disk space. In this work, we propose a unified model that compresses the RAG context and utilizes the same representations for retrieval. This approach minimizes disk utilization compared to using separate representations, while significantly reducing the context size required for generation. With an average of 1/10 of the context, our model matches the performance of a traditional RAG reader without increasing storage requirements compared to a multi-vector retrieval model. This approach represents the first model to unify retrieval and context compression using a shared model and representation. We believe this work will inspire further consolidation of distinct models to optimize on-device performance.

AIMay 12
Controllable User Simulation

Guy Tennenholtz, Ofer Meshi, Amir Globerson et al.

Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.

AIOct 13, 2025
Asking Clarifying Questions for Preference Elicitation With Large Language Models

Ali Montazeralghaem, Guy Tennenholtz, Craig Boutilier et al.

Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history. One way to get more information is to present clarifying questions to the user. However, generating effective sequential clarifying questions across various domains remains a challenge. To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences. Our method follows a two-stage process inspired by diffusion models. Starting from a user profile, the forward process generates clarifying questions to obtain answers and then removes those answers step by step, serving as a way to add ``noise'' to the user profile. The reverse process involves training a model to ``denoise'' the user profile by learning to ask effective clarifying questions. Our results show that our method significantly improves the LLM's proficiency in asking funnel questions and eliciting user preferences effectively.

LGMay 29, 2019
Advantage Amplification in Slowly Evolving Latent-State Environments

Martin Mladenov, Ofer Meshi, Jayden Ooi et al.

Latent-state environments with long horizons, such as those faced by recommender systems, pose significant challenges for reinforcement learning (RL). In this work, we identify and analyze several key hurdles for RL in such environments, including belief state error and small action advantage. We develop a general principle of advantage amplification that can overcome these hurdles through the use of temporal abstraction. We propose several aggregation methods and prove they induce amplification in certain settings. We also bound the loss in optimality incurred by our methods in environments where latent state evolves slowly and demonstrate their performance empirically in a stylized user-modeling task.

LGApr 4, 2019
Empirical Bayes Regret Minimization

Chih-Wei Hsu, Branislav Kveton, Ofer Meshi et al.

Most bandit algorithm designs are purely theoretical. Therefore, they have strong regret guarantees, but also are often too conservative in practice. In this work, we pioneer the idea of algorithm design by minimizing the empirical Bayes regret, the average regret over problem instances sampled from a known distribution. We focus on a tractable instance of this problem, the confidence interval and posterior width tuning, and propose an efficient algorithm for solving it. The tuning algorithm is analyzed and evaluated in multi-armed, linear, and generalized linear bandits. We report several-fold reductions in Bayes regret for state-of-the-art bandit algorithms, simply by optimizing over a small sample from a distribution.

LGNov 1, 2018
Deep Structured Prediction with Nonlinear Output Transformations

Colin Graber, Ofer Meshi, Alexander Schwing

Deep structured models are widely used for tasks like semantic segmentation, where explicit correlations between variables provide important prior information which generally helps to reduce the data needs of deep nets. However, current deep structured models are restricted by oftentimes very local neighborhood structure, which cannot be increased for computational complexity reasons, and by the fact that the output configuration, or a representation thereof, cannot be transformed further. Very recent approaches which address those issues include graphical model inference inside deep nets so as to permit subsequent non-linear output space transformations. However, optimization of those formulations is challenging and not well understood. Here, we develop a novel model which generalizes existing approaches, such as structured prediction energy networks, and discuss a formulation which maintains applicability of existing inference techniques.

IROct 4, 2018
Seq2Slate: Re-ranking and Slate Optimization with RNNs

Irwan Bello, Sayali Kulkarni, Sagar Jain et al.

Ranking is a central task in machine learning and information retrieval. In this task, it is especially important to present the user with a slate of items that is appealing as a whole. This in turn requires taking into account interactions between items, since intuitively, placing an item on the slate affects the decision of which other items should be placed alongside it. In this work, we propose a sequence-to-sequence model for ranking called seq2slate. At each step, the model predicts the next `best' item to place on the slate given the items already selected. The sequential nature of the model allows complex dependencies between the items to be captured directly in a flexible and scalable way. We show how to learn the model end-to-end from weak supervision in the form of easily obtained click-through data. We further demonstrate the usefulness of our approach in experiments on standard ranking benchmarks as well as in a real-world recommendation system.

AIMay 7, 2018
Planning and Learning with Stochastic Action Sets

Craig Boutilier, Alon Cohen, Amit Daniely et al.

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities; and a sampling-based model for unknown distributions. We develop poly-time value and policy iteration methods for both cases; and in the first, we offer a poly-time linear programming solution.

OCMay 20, 2016
Linear-memory and Decomposition-invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes

Dan Garber, Ofer Meshi

Recently, several works have shown that natural modifications of the classical conditional gradient method (aka Frank-Wolfe algorithm) for constrained convex optimization, provably converge with a linear rate when: i) the feasible set is a polytope, and ii) the objective is smooth and strongly-convex. However, all of these results suffer from two significant shortcomings: large memory requirement due to the need to store an explicit convex decomposition of the current iterate, and as a consequence, large running-time overhead per iteration, and worst case convergence rate that depends unfavorably on the dimension. In this work we present a new conditional gradient variant and a corresponding analysis that improves on both of the above shortcomings. In particular: both memory and computation overheads are only linear in the dimension. Moreover, in case the optimal solution is sparse, the new convergence rate replaces a factor which is at least linear in the dimension in previous works, with a linear dependence on the number of non-zeros in the optimal solution. At the heart of our method, and corresponding analysis, is a novel way to compute decomposition-invariant away-steps. While our theoretical guarantees do not apply to any polytope, they apply to several important structured polytopes that capture central concepts such as paths in graphs, perfect matchings in bipartite graphs, marginal distributions that arise in structured prediction tasks, and more. Our theoretical findings are complemented by empirical evidence which shows that our method delivers state-of-the-art performance.

MLNov 4, 2015
Train and Test Tightness of LP Relaxations in Structured Prediction

Ofer Meshi, Mehrdad Mahdavi, Adrian Weller et al.

Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming (LP) relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data.

LGOct 20, 2015
Fast and Scalable Structural SVM with Slack Rescaling

Heejin Choi, Ofer Meshi, Nathan Srebro

We present an efficient method for training slack-rescaled structural SVM. Although finding the most violating label in a margin-rescaled formulation is often easy since the target function decomposes with respect to the structure, this is not the case for a slack-rescaled formulation, and finding the most violated label might be very difficult. Our core contribution is an efficient method for finding the most-violating-label in a slack-rescaled formulation, given an oracle that returns the most-violating-label in a (slightly modified) margin-rescaled formulation. We show that our method enables accurate and scalable training for slack-rescaled SVMs, reducing runtime by an order of magnitude compared to previous approaches to slack-rescaled SVMs.

LGSep 26, 2013
Learning Max-Margin Tree Predictors

Ofer Meshi, Elad Eban, Gal Elidan et al.

Structured prediction is a powerful framework for coping with joint prediction of interacting outputs. A central difficulty in using this framework is that often the correct label dependence structure is unknown. At the same time, we would like to avoid an overly complex structure that will lead to intractable prediction. In this work we address the challenge of learning tree structured predictive models that achieve high accuracy while at the same time facilitate efficient (linear time) inference. We start by proving that this task is in general NP-hard, and then suggest an approximate alternative. Briefly, our CRANK approach relies on a novel Circuit-RANK regularizer that penalizes non-tree structures and that can be optimized using a CCCP procedure. We demonstrate the effectiveness of our approach on several domains and show that, despite the relative simplicity of the structure, prediction accuracy is competitive with a fully connected model that is computationally costly at prediction time.

AIJun 20, 2012
Template Based Inference in Symmetric Relational Markov Random Fields

Ariel Jaimovich, Ofer Meshi, Nir Friedman

Relational Markov Random Fields are a general and flexible framework for reasoning about the joint distribution over attributes of a large number of interacting entities. The main computational difficulty in learning such models is inference. Even when dealing with complete data, where one can summarize a large domain by sufficient statistics, learning requires one to compute the expectation of the sufficient statistics given different parameter choices. The typical solution to this problem is to resort to approximate inference procedures, such as loopy belief propagation. Although these procedures are quite efficient, they still require computation that is on the order of the number of interactions (or features) in the model. When learning a large relational model over a complex domain, even such approximations require unrealistic running time. In this paper we show that for a particular class of relational MRFs, which have inherent symmetry, we can perform the inference needed for learning procedures using a template-level belief propagation. This procedure's running time is proportional to the size of the relational model rather than the size of the domain. Moreover, we show that this computational procedure is equivalent to sychronous loopy belief propagation. This enables a dramatic speedup in inference and learning time. We use this procedure to learn relational MRFs for capturing the joint distribution of large protein-protein interaction networks.

AIMay 9, 2012
Convexifying the Bethe Free Energy

Ofer Meshi, Ariel Jaimovich, Amir Globerson et al.

The introduction of loopy belief propagation (LBP) revitalized the application of graphical models in many domains. Many recent works present improvements on the basic LBP algorithm in an attempt to overcome convergence and local optima problems. Notable among these are convexified free energy approximations that lead to inference procedures with provable convergence and quality properties. However, empirically LBP still outperforms most of its convex variants in a variety of settings, as we also demonstrate here. Motivated by this fact we seek convexified free energies that directly approximate the Bethe free energy. We show that the proposed approximations compare favorably with state-of-the art convex free energy approximations.