Vijay Kamble

LG
9papers
146citations
Novelty54%
AI Score40

9 Papers

LGFeb 10
Causal Identification in Multi-Task Demand Learning with Confounding

Varun Gupta, Vijay Kamble

We study a canonical multi-task demand learning problem motivated by retail pricing, in which a firm seeks to estimate heterogeneous linear price-response functions across a large collection of decision contexts. Each context is characterized by rich observable covariates yet typically exhibits only limited historical price variation, motivating the use of multi-task learning to borrow strength across tasks. A central challenge in this setting is endogeneity: historical prices are chosen by managers or algorithms and may be arbitrarily correlated with unobserved, task-level demand determinants. Under such confounding by latent fundamentals, commonly used approaches, such as pooled regression and meta-learning, fail to identify causal price effects. We propose a new estimation framework that achieves causal identification despite arbitrary dependence between prices and latent task structure. Our approach, Decision-Conditioned Masked-Outcome Meta-Learning (DCMOML), involves carefully designing the information set of a meta-learner to leverage cross-task heterogeneity while accounting for endogenous decision histories. Under a mild restriction on price adaptivity in each task, we establish that this method identifies the conditional mean of the task-specific causal parameters given the designed information set. Our results provide guarantees for large-scale demand estimation with endogenous prices and small per-task samples, offering a principled foundation for deploying causal, data-driven pricing models in operational environments.

LGMar 16, 2021
Algorithmic Challenges in Ensuring Fairness at the Time of Decision

Jad Salem, Swati Gupta, Vijay Kamble

Algorithmic decision-making in societal contexts, such as retail pricing, loan administration, recommendations on online platforms, etc., can be framed as stochastic optimization under bandit feedback, which typically requires experimentation with different decisions for the sake of learning. Such experimentation often results in perceptions of unfairness among people impacted by these decisions; for instance, there have been several recent lawsuits accusing companies that deploy algorithmic pricing practices of price gouging. Motivated by the changing legal landscape surrounding algorithmic decision-making, we introduce the well-studied fairness notion of envy-freeness within the context of stochastic convex optimization. Our notion requires that upon receiving decisions in the present time, groups do not envy the decisions received by any of the other groups, both in the present as well as the past. This results in a novel trajectory-constrained stochastic optimization problem that renders existing techniques inapplicable. The main technical contribution of this work is to show problem settings where there is no gap in achievable regret (up to logarithmic factors) when envy-freeness is imposed. In particular, in our main result, we develop a near-optimal envy-free algorithm that achieves $\tilde{O}(\sqrt{T})$ regret for smooth convex functions that satisfy the PL inequality. This algorithm has a coordinate-descent structure, in which we carefully leverage gradient information to ensure monotonic sampling along each dimension, while avoiding overshooting the constrained optimum with high probability. This latter aspect critically uses smoothness and the structure of the envy-freeness constraints, while the PL inequality allows for sufficient progress towards the optimal solution. We discuss several open questions that arise from this analysis, which may be of independent interest.

LGJun 11, 2020
Maximal Objectives in the Multi-armed Bandit with Applications

Eren Ozbay, Vijay Kamble

In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $Ω(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of $Ω(K^{1/3}T^{2/3})$. We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top $m$ arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of grooming an adequate supply of value-providing market participants (workers/sellers/service providers) in online platforms.

LGDec 10, 2018
Individual Fairness in Hindsight

Swati Gupta, Vijay Kamble

Since many critical decisions impacting human lives are increasingly being made by algorithms, it is important to ensure that the treatment of individuals under such algorithms is demonstrably fair under reasonable notions of fairness. One compelling notion proposed in the literature is that of individual fairness (IF), which advocates that similar individuals should be treated similarly (Dwork et al. 2012). Originally proposed for offline decisions, this notion does not, however, account for temporal considerations relevant for online decision-making. In this paper, we extend the notion of IF to account for the time at which a decision is made, in settings where there exists a notion of conduciveness of decisions as perceived by the affected individuals. We introduce two definitions: (i) fairness-across-time (FT) and (ii) fairness-in-hindsight (FH). FT is the simplest temporal extension of IF where treatment of individuals is required to be individually fair relative to the past as well as future, while in FH, we require a one-sided notion of individual fairness that is defined relative to only the past decisions. We show that these two definitions can have drastically different implications in the setting where the principal needs to learn the utility model. Linear regret relative to optimal individually fair decisions is inevitable under FT for non-trivial examples. On the other hand, we design a new algorithm: Cautious Fair Exploration (CaFE), which satisfies FH and achieves sub-linear regret guarantees for a broad range of settings. We characterize lower bounds showing that these guarantees are order-optimal in the worst case. FH can thus be embedded as a primary safeguard against unfair discrimination in algorithmic deployments, without hindering the ability to take good decisions in the long-run.

LGSep 18, 2018
Exploration vs. Exploitation in Team Formation

Ramesh Johari, Vijay Kamble, Anilesh K. Krishnaswamy et al.

An online labor platform faces an online learning problem in matching workers with jobs and using the performance on these jobs to create better future matches. This learning problem is complicated by the rise of complex tasks on these platforms, such as web development and product design, that require a team of workers to complete. The success of a job is now a function of the skills and contributions of all workers involved, which may be unknown to both the platform and the client who posted the job. These team matchings result in a structured correlation between what is known about the individuals and this information can be utilized to create better future matches. We analyze two natural settings where the performance of a team is dictated by its strongest and its weakest member, respectively. We find that both problems pose an exploration-exploitation tradeoff between learning the performance of untested teams and repeating previously tested teams that resulted in a good performance. We establish fundamental regret bounds and design near-optimal algorithms that uncover several insights into these tradeoffs.

GTMar 16, 2016
An Approximate Dynamic Programming Approach to Adversarial Online Learning

Vijay Kamble, Patrick Loiseau, Jean Walrand

We describe an approximate dynamic programming (ADP) approach to compute approximations of the optimal strategies and of the minimal losses that can be guaranteed in discounted repeated games with vector-valued losses. Such games prominently arise in the analysis of regret in repeated decision-making in adversarial environments, also known as adversarial online learning. At the core of our approach is a characterization of the lower Pareto frontier of the set of expected losses that a player can guarantee in these games as the unique fixed point of a set-valued dynamic programming operator. When applied to the problem of regret minimization with discounted losses, our approach yields algorithms that achieve markedly improved performance bounds compared to off-the-shelf online learning algorithms like Hedge. These results thus suggest the significant potential of ADP-based approaches in adversarial online learning.

LGMar 15, 2016
Matching while Learning

Ramesh Johari, Vijay Kamble, Yash Kanoria

We consider the problem faced by a service platform that needs to match limited supply with demand but also to learn the attributes of new users in order to match them better in the future. We introduce a benchmark model with heterogeneous "workers" (demand) and a limited supply of "jobs" that arrive over time. Job types are known to the platform, but worker types are unknown and must be learned by observing match outcomes. Workers depart after performing a certain number of jobs. The expected payoff from a match depends on the pair of types and the goal is to maximize the steady-state rate of accumulation of payoff. Though we use terminology inspired by labor markets, our framework applies more broadly to platforms where a limited supply of heterogeneous products is matched to users over time. Our main contribution is a complete characterization of the structure of the optimal policy in the limit that each worker performs many jobs. The platform faces a trade-off for each worker between myopically maximizing payoffs (exploitation) and learning the type of the worker (exploration). This creates a multitude of multi-armed bandit problems, one for each worker, coupled together by the constraint on availability of jobs of different types (capacity constraints). We find that the platform should estimate a shadow price for each job type, and use the payoffs adjusted by these prices, first, to determine its learning goals and then, for each worker, (i) to balance learning with payoffs during the "exploration phase," and (ii) to myopically match after it has achieved its learning goals during the "exploitation phase."

GTJul 25, 2015
The Square Root Agreement Rule for Incentivizing Truthful Feedback on Online Platforms

Vijay Kamble, Nihar Shah, David Marn et al.

A major challenge in obtaining evaluations of products or services on e-commerce platforms is eliciting informative responses in the absence of verifiability. This paper proposes the Square Root Agreement Rule (SRA): a simple reward mechanism that incentivizes truthful responses to objective evaluations on such platforms. In this mechanism, an agent gets a reward for an evaluation only if her answer matches that of her peer, where this reward is inversely proportional to a popularity index of the answer. This index is defined to be the square root of the empirical frequency at which any two agents performing the same evaluation agree on the particular answer across evaluations of similar entities operating on the platform. Rarely agreed-upon answers thus earn a higher reward than answers for which agreements are relatively more common. We show that in the many tasks regime, the truthful equilibrium under SRA is strictly payoff-dominant across large classes of natural equilibria that could arise in these settings, thus increasing the likelihood of its adoption. While there exist other mechanisms achieving such guarantees, they either impose additional assumptions on the response distribution that are not generally satisfied for objective evaluations or they incentivize truthful behavior only if each agent performs a prohibitively large number of evaluations and commits to using the same strategy for each evaluation. SRA is the first known incentive mechanism satisfying such guarantees without imposing any such requirements. Moreover, our empirical findings demonstrate the robustness of the incentive properties of SRA in the presence of mild subjectivity or observational biases in the responses. These properties make SRA uniquely attractive for administering reward-based incentive schemes (e.g., rebates, discounts, reputation scores, etc.) on online platforms.

LGMar 6, 2015
Sequential Relevance Maximization with Binary Feedback

Vijay Kamble, Nadia Fawaz, Fernando Silveira

Motivated by online settings where users can provide explicit feedback about the relevance of products that are sequentially presented to them, we look at the recommendation process as a problem of dynamically optimizing this relevance feedback. Such an algorithm optimizes the fine tradeoff between presenting the products that are most likely to be relevant, and learning the preferences of the user so that more relevant recommendations can be made in the future. We assume a standard predictive model inspired by collaborative filtering, in which a user is sampled from a distribution over a set of possible types. For every product category, each type has an associated relevance feedback that is assumed to be binary: the category is either relevant or irrelevant. Assuming that the user stays for each additional recommendation opportunity with probability $β$ independent of the past, the problem is to find a policy that maximizes the expected number of recommendations that are deemed relevant in a session. We analyze this problem and prove key structural properties of the optimal policy. Based on these properties, we first present an algorithm that strikes a balance between recursion and dynamic programming to compute this policy. We further propose and analyze two heuristic policies: a `farsighted' greedy policy that attains at least $1-β$ factor of the optimal payoff, and a naive greedy policy that attains at least $\frac{1-β}{1+β}$ factor of the optimal payoff in the worst case. Extensive simulations show that these heuristics are very close to optimal in practice.