Vladas Pipiras

LG
3papers
4citations
Novelty50%
AI Score23

3 Papers

LGSep 1, 2023
Consistency of Lloyd's Algorithm Under Perturbations

Dhruv Patel, Hui Shen, Shankar Bhamidi et al.

In the context of unsupervised learning, Lloyd's algorithm is one of the most widely used clustering algorithms. It has inspired a plethora of work investigating the correctness of the algorithm under various settings with ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the mis-clustering rate of Lloyd's algorithm on $n$ independent samples from a sub-Gaussian mixture is exponentially bounded after $O(\log(n))$ iterations, assuming proper initialization of the algorithm. However, in many applications, the true samples are unobserved and need to be learned from the data via pre-processing pipelines such as spectral methods on appropriate data matrices. We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples from a sub-Gaussian mixture is also exponentially bounded after $O(\log(n))$ iterations under the assumptions of proper initialization and that the perturbation is small relative to the sub-Gaussian noise. In canonical settings with ground truth clusters, we derive bounds for algorithms such as $k$-means$++$ to find good initializations and thus leading to the correctness of clustering via the main result. We show the implications of the results for pipelines measuring the statistical significance of derived clusters from data such as SigClust. We use these general results to derive implications in providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.

MLMay 26, 2021
Block Dense Weighted Networks with Augmented Degree Correction

Benjamin Leinwand, Vladas Pipiras

Dense networks with weighted connections often exhibit a community like structure, where although most nodes are connected to each other, different patterns of edge weights may emerge depending on each node's community membership. We propose a new framework for generating and estimating dense weighted networks with potentially different connectivity patterns across different communities. The proposed model relies on a particular class of functions which map individual node characteristics to the edges connecting those nodes, allowing for flexibility while requiring a small number of parameters relative to the number of edges. By leveraging the estimation techniques, we also develop a bootstrap methodology for generating new networks on the same set of vertices, which may be useful in circumstances where multiple data sets cannot be collected. Performance of these methods are analyzed in theory, simulations, and real data.

MEJul 9, 2020
Penalized Estimation and Forecasting of Multiple Subject Intensive Longitudinal Data

Zachary F. Fisher, Younghoon Kim, Barbara Fredrickson et al.

Intensive Longitudinal Data (ILD) is increasingly available to social and behavioral scientists. With this increased availability come new opportunities for modeling and predicting complex biological, behavioral, and physiological phenomena. Despite these new opportunities psychological researchers have not taken full advantage of promising opportunities inherent to this data, the potential to forecast psychological processes at the individual level. To address this gap in the literature we present a novel modeling framework that addresses a number of topical challenges and open questions in the psychological literature on modeling dynamic processes. First, how can we model and forecast ILD when the length of individual time series and the number of variables collected are roughly equivalent, or when time series lengths are shorter than what is typically required for time series analyses? Second, how can we best take advantage of the cross-sectional (between-person) information inherent to most ILD scenarios while acknowledging individuals differ both quantitatively (e.g. in parameter magnitude) and qualitatively (e.g. in structural dynamics)? Despite the acknowledged between-person heterogeneity in many psychological processes is it possible to leverage group-level information to support improved forecasting at the individual level? In the remainder of the manuscript, we attempt to address these and other pressing questions relevant to the forecasting of multiple-subject ILD.