Feng Ruan

ML
h-index14
17papers
691citations
Novelty56%
AI Score46

17 Papers

MLAug 21, 2022
FastCPH: Efficient Survival Analysis for Neural Networks

Xuelin Yang, Louis Abraham, Sejin Kim et al.

The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates -- it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose FastCPH, a new method that runs in linear time and supports both the standard Breslow and Efron methods for tied events. We also demonstrate the performance of FastCPH combined with LassoNet, a neural network that provides interpretability through feature sparsity, on survival datasets. The final procedure is efficient, selects useful covariates and outperforms existing CoxPH approaches.

MENov 21, 2024Code
Robust Detection of Watermarks for Large Language Models Under Human Edits

Xiang Li, Feng Ruan, Huiyuan Wang et al.

Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.

LGJan 29
RAPTOR: Ridge-Adaptive Logistic Probes

Ziqi Gao, Yaotian Zhu, Qingcheng Zeng et al.

Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.

STApr 1, 2024
A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules

Xiang Li, Feng Ruan, Huiyuan Wang et al.

Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key -- provided by the LLM to the verifier -- to enable controlling the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks -- one of which has been internally implemented at OpenAI -- and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments.

MLApr 5, 2024
Nonparametric Modern Hopfield Models

Jerry Yao-Chieh Hu, Bo-Yu Chen, Dennis Wu et al.

We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs. Interestingly, our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing \textit{sparse-structured} modern Hopfield models with sub-quadratic complexity. We establish that this sparse model inherits the appealing theoretical properties of its dense analogue -- connection with transformer attention, fixed point convergence and exponential memory capacity. Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-$K$ and positive random feature modern Hopfield models. Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks.

MLFeb 17, 2025
On the kernel learning problem

Yang Li, Feng Ruan

The classical kernel ridge regression problem aims to find the best fit for the output $Y$ as a function of the input data $X\in \mathbb{R}^d$, with a fixed choice of regularization term imposed by a given choice of a reproducing kernel Hilbert space, such as a Sobolev space. Here we consider a generalization of the kernel ridge regression problem, by introducing an extra matrix parameter $U$, which aims to detect the scale parameters and the feature variables in the data, and thereby improve the efficiency of kernel ridge regression. This naturally leads to a nonlinear variational problem to optimize the choice of $U$. We study various foundational mathematical aspects of this variational problem, and in particular how this behaves in the presence of multiscale structures in the data.

LGSep 17, 2025
A Compositional Kernel Model for Feature Learning

Feng Ruan, Keli Liu, Michael Jordan

We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.

MLJun 11, 2024
On the Limitation of Kernel Dependence Maximization for Feature Selection

Keli Liu, Feng Ruan

A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features.

MLOct 12, 2021
On the Self-Penalization Phenomenon in Feature Selection

Michael I. Jordan, Keli Liu, Feng Ruan

We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: \begin{equation*} \min_{β, f}~\widehat{\mathbb{E}}[L(Y, f(β^{1/q} \odot X)] + λ_n \|f\|_{\mathcal{H}_q}^2~~\text{subject to}~~β\ge 0, \end{equation*} where $L$ is the loss, $\odot$ is coordinate-wise multiplication and $\mathcal{H}_q$ is the reproducing kernel Hilbert space based on the kernel $k_q(x, x') = h(\|x-x'\|_q^q)$, where $\|\cdot\|_q$ is the $\ell_q$ norm. Using gradient descent to optimize this objective with respect to $β$ leads to exactly sparse stationary points with high probability. The sparsity is achieved without using any of the well-known explicit sparsification techniques such as penalization (e.g., $\ell_1$), early stopping or post-processing (e.g., clipping). As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.

STJun 17, 2021
Taming Nonconvexity in Kernel Feature Selection -- Favorable Properties of the Laplace Kernel

Feng Ruan, Keli Liu, Michael I. Jordan

Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples.

LGDec 14, 2020
Bandit Learning in Decentralized Matching Markets

Lydia T. Liu, Feng Ruan, Horia Mania et al.

We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience. Also, we assume the players have no direct means of communication. This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player setting with competition. We introduce a new algorithm for this setting that, over a time horizon $T$, attains $\mathcal{O}(\log(T))$ stable regret when preferences of the arms over players are shared, and $\mathcal{O}(\log(T)^2)$ regret when there are no assumptions on the preferences on either side. Moreover, in the setting where a single player may deviate, we show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general.

MENov 24, 2020
A Self-Penalizing Objective Function for Scalable Interaction Detection

Keli Liu, Feng Ruan

We tackle the problem of nonparametric variable selection with a focus on discovering interactions between variables. With $p$ variables there are $O(p^s)$ possible order-$s$ interactions making exhaustive search infeasible. It is nonetheless possible to identify the variables involved in interactions with only linear computation cost, $O(p)$. The trick is to maximize a class of parametrized nonparametric dependence measures which we call metric learning objectives; the landscape of these nonconvex objective functions is sensitive to interactions but the objectives themselves do not explicitly model interactions. Three properties make metric learning objectives highly attractive: (a) The stationary points of the objective are automatically sparse (i.e. performs selection) -- no explicit $\ell_1$ penalization is needed. (b) All stationary points of the objective exclude noise variables with high probability. (c) Guaranteed recovery of all signal variables without needing to reach the objective's global maxima or special stationary points. The second and third properties mean that all our theoretical results apply in the practical case where one uses gradient ascent to maximize the metric learning objective. While not all metric learning objectives enjoy good statistical power, we design an objective based on $\ell_1$ kernels that does exhibit favorable power: it recovers (i) main effects with $n \sim \log p$ samples, (ii) hierarchical interactions with $n \sim \log p$ samples and (iii) order-$s$ pure interactions with $n \sim p^{2(s-1)}\log p$ samples.

MLMar 16, 2020
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

Koulik Khamaru, Ashwin Pananjady, Feng Ruan et al.

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.

STNov 5, 2019
The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime

Andrea Montanari, Feng Ruan, Youngtak Sohn et al.

Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data. We consider a stylized setting in which data $(y_i,{\boldsymbol x}_i)$, $i\le n$ are i.i.d. with ${\boldsymbol x}_i\sim\mathsf{N}({\boldsymbol 0},{\boldsymbol Σ})$ a $p$-dimensional Gaussian feature vector, and $y_i \in\{+1,-1\}$ a label whose distribution depends on a linear combination of the covariates $\langle {\boldsymbol θ}_*,{\boldsymbol x}_i \rangle$. While the Gaussian model might appear extremely simplistic, universality arguments can be used to show that the results derived in this setting also apply to the output of certain nonlinear featurization maps. We consider the proportional asymptotics $n,p\to\infty$ with $p/n\to ψ$, and derive exact expressions for the limiting generalization error. We use this theory to derive two results of independent interest: $(i)$ Sufficient conditions on $({\boldsymbol Σ},{\boldsymbol θ}_*)$ for `benign overfitting' that parallel previously derived conditions in the case of linear regression; $(ii)$ An asymptotically exact expression for the generalization error when max-margin classification is used in conjunction with feature vectors produced by random one-layer neural networks.

MLJul 29, 2019
LassoNet: A Neural Network with Feature Sparsity

Ismael Lemhadri, Feng Ruan, Louis Abraham et al.

Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

STDec 16, 2016
Asymptotic Optimality in Stochastic Optimization

John Duchi, Feng Ruan

We study local complexity measures for stochastic convex optimization problems, providing a local minimax theory analogous to that of Hájek and Le Cam for classical statistical problems. We give complementary optimality results, developing fully online methods that adaptively achieve optimal convergence guarantees. Our results provide function-specific lower bounds and convergence results that make precise a correspondence between statistical difficulty and the geometric notion of tilt-stability from optimization. As part of this development, we show how variants of Nesterov's dual averaging---a stochastic gradient-based procedure---guarantee finite time identification of constraints in optimization problems, while stochastic gradient procedures fail. Additionally, we highlight a gap between problems with linear and nonlinear constraints: standard stochastic-gradient-based procedures are suboptimal even for the simplest nonlinear constraints, necessitating the development of asymptotically optimal Riemannian stochastic gradient methods.

STJun 30, 2014
Sparse Recovery via Differential Inclusions

Stanley Osher, Feng Ruan, Jiechao Xiong et al.

In this paper, we recover sparse signals from their noisy linear measurements by solving nonlinear differential inclusions, which is based on the notion of inverse scale space (ISS) developed in applied mathematics. Our goal here is to bring this idea to address a challenging problem in statistics, \emph{i.e.} finding the oracle estimator which is unbiased and sign-consistent using dynamics. We call our dynamics \emph{Bregman ISS} and \emph{Linearized Bregman ISS}. A well-known shortcoming of LASSO and any convex regularization approaches lies in the bias of estimators. However, we show that under proper conditions, there exists a bias-free and sign-consistent point on the solution paths of such dynamics, which corresponds to a signal that is the unbiased estimate of the true signal and whose entries have the same signs as those of the true signs, \emph{i.e.} the oracle estimator. Therefore, their solution paths are regularization paths better than the LASSO regularization path, since the points on the latter path are biased when sign-consistency is reached. We also show how to efficiently compute their solution paths in both continuous and discretized settings: the full solution paths can be exactly computed piece by piece, and a discretization leads to \emph{Linearized Bregman iteration}, which is a simple iterative thresholding rule and easy to parallelize. Theoretical guarantees such as sign-consistency and minimax optimal $l_2$-error bounds are established in both continuous and discrete settings for specific points on the paths. Early-stopping rules for identifying these points are given. The key treatment relies on the development of differential inequalities for differential inclusions and their discretizations, which extends the previous results and leads to exponentially fast recovering of sparse signals before selecting wrong ones.