49.2MLMay 7
Dynamic Treatment on NetworksBengusu Nar, Jiguang Li, Veronika Ročková et al.
In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a three-stage pipeline that (i) estimates network adoption dynamics via a Bayesian dynamic Ising model from a single observed panel, (ii) augments treatment adoption histories with continuous posterior latent states, and (iii) learns a dynamic policy via offline reinforcement learning. The Bayesian mechanism enables uncertainty quantification over dynamic decisions, yielding posterior ensemble policies with interpretable spillover estimates. We provide a finite-sample regret upper bound that decomposes into standard offline-RL uncertainty, network abstraction error, and first stage error in Ising state estimation. We apply our method to data from Indian village microfinance networks and synthetic stochastic block models under simulated heterogeneous susceptible-infected-susceptible (SIS) dynamics and demonstrate that adaptive targeting outperforms static centrality benchmarks.
LGMay 2, 2025
Stabilizing Temporal Difference Learning via Implicit Stochastic RecursionHwanwoo Kim, Panos Toulis, Eric Laber
Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL). For nearly forty years, TD learning has served as a workhorse for applied RL as well as a building block for more complex and specialized algorithms. However, despite its widespread use, TD procedures are generally sensitive to step size specification. A poor choice of step size can dramatically increase variance and slow convergence in both on-policy and off-policy evaluation tasks. In practice, researchers use trial and error to identify stable step sizes, but these approaches tend to be ad hoc and inefficient. As an alternative, we propose implicit TD algorithms that reformulate TD updates into fixed point equations. Such updates are more stable and less sensitive to step size without sacrificing computational efficiency. Moreover, we derive asymptotic convergence guarantees and finite-time error bounds for our proposed implicit TD algorithms, which include implicit TD(0), TD($λ$), and TD with gradient correction (TDC). Our results show that implicit TD algorithms are applicable to a much broader range of step sizes, and thus provide a robust and versatile framework for policy evaluation and value approximation in modern RL tasks. We demonstrate these benefits empirically through extensive numerical examples spanning both on-policy and off-policy tasks.
OCNov 11, 2021
Convergence and Stability of the Stochastic Proximal Point Algorithm with MomentumJunhyung Lyle Kim, Panos Toulis, Anastasios Kyrillidis
Stochastic gradient descent with momentum (SGDM) is the dominant algorithm in many optimization scenarios, including convex optimization instances and non-convex neural network training. Yet, in the stochastic setting, momentum interferes with gradient noise, often leading to specific step size and momentum choices in order to guarantee convergence, set aside acceleration. Proximal point methods, on the other hand, have gained much attention due to their numerical stability and elasticity against imperfect tuning. Their stochastic accelerated variants though have received limited attention: how momentum interacts with the stability of (stochastic) proximal point methods remains largely unstudied. To address this, we focus on the convergence and stability of the stochastic proximal point algorithm with momentum (SPPAM), and show that SPPAM allows a faster linear convergence to a neighborhood compared to the stochastic proximal point algorithm (SPPA) with a better contraction factor, under proper hyperparameter tuning. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM, allowing a wider range of step size and momentum that lead to convergence.
MEJun 14, 2021
Robust Inference for High-Dimensional Linear Models via Residual RandomizationY. Samuel Wang, Si Kai Lee, Panos Toulis et al.
We propose a residual randomization procedure designed for robust Lasso-based inference in the high-dimensional setting. Compared to earlier work that focuses on sub-Gaussian errors, the proposed procedure is designed to work robustly in settings that also include heavy-tailed covariates and errors. Moreover, our procedure can be valid under clustered errors, which is important in practice, but has been largely overlooked by earlier work. Through extensive simulations, we illustrate our method's wider range of applicability as suggested by theory. In particular, we show that our method outperforms state-of-art methods in challenging, yet more realistic, settings where the distribution of covariates is heavy-tailed or the sample size is small, while it remains competitive in standard, "well behaved" settings previously studied in the literature.
MEAug 12, 2019
Asymptotic Validity and Finite-Sample Properties of Approximate Randomization TestsPanos Toulis
Randomization tests rely on simple data transformations and possess an appealing robustness property. In addition to being finite-sample valid if the data distribution is invariant under the transformation, these tests can be asymptotically valid under a suitable studentization of the test statistic, even if the invariance does not hold. However, practical implementation often encounters noisy data, resulting in approximate randomization tests that may not be as robust. In this paper, our key theoretical contribution is a non-asymptotic bound on the discrepancy between the size of an approximate randomization test and the size of the original randomization test using noiseless data. This allows us to derive novel conditions for the validity of approximate randomization tests under data invariances, while being able to leverage existing results based on studentization if the invariance does not hold. We illustrate our theory through several examples, including tests of significance in linear regression. Our theory can explain certain aspects of how randomization tests perform in small samples, addressing limitations of prior theoretical results.
MLOct 17, 2017
Convergence diagnostics for stochastic gradient descent with constant step sizeJerry Chee, Panos Toulis
Many iterative procedures in stochastic optimization exhibit a transient phase followed by a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in that region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant learning rate. We present theory and experiments suggesting that the region where the proposed diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate each time stationarity is detected. This leads to a new variant of stochastic gradient descent, which in many settings is comparable to state-of-art.
STOct 4, 2015
The Proximal Robbins-Monro MethodPanos Toulis, Thibaut Horel, Edoardo M. Airoldi
The need for parameter estimation with massive datasets has reinvigorated interest in stochastic optimization and iterative estimation procedures. Stochastic approximations are at the forefront of this recent development as they yield procedures that are simple, general, and fast. However, standard stochastic approximations are often numerically unstable. Deterministic optimization, on the other hand, increasingly uses proximal updates to achieve numerical stability in a principled manner. A theoretical gap has thus emerged. While standard stochastic approximations are subsumed by the framework of Robbins and Monro (1951), there is no such framework for stochastic approximations with proximal updates. In this paper, we conceptualize a proximal version of the classical Robbins-Monro procedure. Our theoretical analysis demonstrates that the proposed procedure has important stability benefits over the classical Robbins-Monro procedure, while it retains the best known convergence rates. Exact implementations of the proximal Robbins-Monro procedure are challenging, but we show that approximate implementations lead to procedures that are easy to implement, and still dominate classical procedures by achieving numerical stability, practically without tradeoffs. Moreover, approximate proximal Robbins-Monro procedures can be applied even when the objective cannot be calculated analytically, and so they generalize stochastic proximal procedures currently in use.
COSep 22, 2015
Stochastic gradient descent methods for estimation with large data setsDustin Tran, Panos Toulis, Edoardo M. Airoldi
We develop methods for parameter estimation in settings with large-scale data sets, where traditional methods are no longer tenable. Our methods rely on stochastic approximations, which are computationally efficient as they maintain one iterate as a parameter estimate, and successively update that iterate based on a single data point. When the update is based on a noisy gradient, the stochastic approximation is known as standard stochastic gradient descent, which has been fundamental in modern applications with large data sets. Additionally, our methods are numerically stable because they employ implicit updates of the iterates. Intuitively, an implicit update is a shrinked version of a standard one, where the shrinkage factor depends on the observed Fisher information at the corresponding data point. This shrinkage prevents numerical divergence of the iterates, which can be caused either by excess noise or outliers. Our sgd package in R offers the most extensive and robust implementation of stochastic gradient descent methods. We demonstrate that sgd dominates alternative software in runtime for several estimation problems with massive data sets. Our applications include the wide class of generalized linear models as well as M-estimation for robust regression.
MEMay 10, 2015
Towards stability and optimality in stochastic gradient descentPanos Toulis, Dustin Tran, Edoardo M. Airoldi
Iterative procedures for parameter estimation based on stochastic gradient descent allow the estimation to scale to massive data sets. However, in both theory and practice, they suffer from numerical instability. Moreover, they are statistically inefficient as estimators of the true parameter value. To address these two issues, we propose a new iterative procedure termed averaged implicit SGD (AI-SGD). For statistical efficiency, AI-SGD employs averaging of the iterates, which achieves the optimal Cramér-Rao bound under strong convexity, i.e., it is an optimal unbiased estimator of the true parameter value. For numerical stability, AI-SGD employs an implicit update at each iteration, which is related to proximal operators in optimization. In practice, AI-SGD achieves competitive performance with other state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal updates, and is simple to implement as it requires fewer tunable hyperparameters than procedures that do employ proximal updates.
MLDec 21, 2014
Implicit Temporal DifferencesAviv Tamar, Panos Toulis, Shie Mannor et al.
In reinforcement learning, the TD($λ$) algorithm is a fundamental policy evaluation method with an efficient online implementation that is suitable for large-scale problems. One practical drawback of TD($λ$) is its sensitivity to the choice of the step-size. It is an empirically well-known fact that a large step-size leads to fast convergence, at the cost of higher variance and risk of instability. In this work, we introduce the implicit TD($λ$) algorithm which has the same function and computational cost as TD($λ$), but is significantly more stable. We provide a theoretical explanation of this stability and an empirical evaluation of implicit TD($λ$) on typical benchmark tasks. Our results show that implicit TD($λ$) outperforms standard TD($λ$) and a state-of-the-art method that automatically tunes the step-size, and thus shows promise for wide applicability.
MEAug 13, 2014
Asymptotic and finite-sample properties of estimators based on stochastic gradientsPanos Toulis, Edoardo M. Airoldi
Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets