65.4STMay 8
Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond GaussiansJane H. Lee, Anay Mehrotra, Manolis Zampetakis
We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set $S \subseteq \mathbb{R}^d$. Kontonis, Tzamos, and Zampetakis (FOCS'19) gave a $d^{\mathrm{poly}(1/\varepsilon)}$ time algorithm for finding $\varepsilon$-accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT'24) showed that this exponential dependence on $1/\varepsilon$ is necessary even when $S$ belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design $\mathrm{poly}(d/\varepsilon)$ time algorithms when $S$ is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a $d^{\mathrm{poly}(\ell/\varepsilon)}$ time algorithm for any exponential family that satisfies some structural assumptions and any unknown set $S$ that is $\varepsilon$-approximable by degree-$\ell$ polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown $S$; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime $\mathrm{poly}(d/\varepsilon)$ that works for a set of exponential families (containing all Gaussians) when $S$ is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts.
63.6MLMay 8
A Note on Non-Negative $L_1$-Approximating PolynomialsJane H. Lee, Anay Mehrotra, Manolis Zampetakis
$L_1$-Approximating polynomials, i.e., polynomials that approximate indicator functions in $L_1$-norm under certain distributions, are widely used in computational learning theory. We study the existence of \textit{non-negative} $L_1$-approximating polynomials with respect to Gaussian distributions. This is a stronger requirement than $L_1$-approximation but weaker than sandwiching polynomials (which themselves have many applications). These non-negative approximating polynomials have recently found uses in smoothed learning from positive-only examples. In this short note, we prove that every class of sets with Gaussian surface area (GSA) at most $Γ$ under the standard Gaussian admits degree-$k$ non-negative polynomials that $\eps$-approximate its indicator functions in $L_1$-norm, for $k=\tilde{O}(Γ^2/\varepsilon^2)$. Equivalently, finite GSA implies $L_1$-approximation with the stronger pointwise guarantee that the approximating polynomial has range contained in $[0,\infty)$. Up to a constant-factor, this matches the degree of the best currently known Gaussian $L_1$-approximation degree bound without the non-negativity constraint.
MLApr 14, 2025
Learning with Positive and Imperfect Unlabeled DataJane H. Lee, Anay Mehrotra, Manolis Zampetakis
We study the problem of learning binary classifiers from positive and unlabeled data when the unlabeled data distribution is shifted, which we call Positive and Imperfect Unlabeled (PIU) Learning. In the absence of covariate shifts, i.e., with perfect unlabeled data, Denis (1998) reduced this problem to learning under Massart noise; however, that reduction fails under even slight shifts. Our main results on PIU learning are the characterizations of the sample complexity of PIU learning and a computationally and sample-efficient algorithm achieving a misclassification error $\varepsilon$. We further show that our results lead to new algorithms for several related problems. 1. Learning from smooth distributions: We give algorithms that learn interesting concept classes from only positive samples under smooth feature distributions, bypassing known existing impossibility results and contributing to recent advances in smoothened learning (Haghtalab et al, J.ACM'24) (Chandrasekaran et al., COLT'24). 2. Learning with a list of unlabeled distributions: We design new algorithms that apply to a broad class of concept classes under the assumption that we are given a list of unlabeled distributions, one of which--unknown to the learner--is $O(1)$-close to the true feature distribution. 3. Estimation in the presence of unknown truncation: We give the first polynomial sample and time algorithm for estimating the parameters of an exponential family distribution from samples truncated to an unknown set approximable by polynomials in $L_1$-norm. This improves the algorithm by Lee et al. (FOCS'24) that requires approximation in $L_2$-norm. 4. Detecting truncation: We present new algorithms for detecting whether given samples have been truncated (or not) for a broad class of non-product distributions, including non-product distributions, improving the algorithm by De et al. (STOC'24).
IROct 24, 2025
Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative RecommendersZhimin Chen, Chenyu Zhao, Ka Chun Mo et al.
Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer-like architectures, has led to significant advancements recently (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industry leading recommendation platform serving billions of users.
LGOct 23, 2025
Risk-Averse Constrained Reinforcement Learning with Optimized Certainty EquivalentsJane H. Lee, Baturay Saglam, Spyridon Pougkakiotis et al.
Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm under common assumptions, and verify the risk-aware properties of our approach through several numerical experiments.