Terrance D. Savitsky

h-index10

2papers

600citations

2 Papers

1.2MSApr 12, 2023

R-Shiny Applications for Local Clustering to be Included in the growclusters for R Package

Randall Powers, Wendy Martinez, Terrance Savitsky

growclusters for R is a package that estimates a partition structure for multivariate data. It does this by implementing a hierarchical version of k-means clustering that accounts for possible known dependencies in a collection of datasets, where each set draws its cluster means from a single, global partition. Each component data set in the collection corresponds to a known group in the data. This paper focuses on R Shiny applications that implement the clustering methodology and simulate data sets with known group structures. These Shiny applications implement novel ways of visualizing the results of the clustering. These visualizations include scatterplots of individual data sets in the context of the entire collection and cluster distributions versus component (or sub-domain) datasets. Data obtained from a collection of 2000-2013 articles from the Bureau of Labor Statistics (BLS) Monthly Labor Review (MLR) will be used to illustrate the R-Shiny applications. Here, the known grouping in the collection is the year of publication.

4.5MLMar 27, 2025

Bayesian Pseudo Posterior Mechanism for Differentially Private Machine Learning

Robert Chew, Matthew R. Williams, Elan A. Segarra et al.

Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work, we propose a new scalable DP mechanism for deep learning models, SWAG-PPM, by using a pseudo posterior distribution that downweights by-record likelihood contributions proportionally to their disclosure risks as the randomized mechanism. As a motivating example from official statistics, we demonstrate SWAG-PPM on a workplace injury text classification task using a highly imbalanced public dataset published by the U.S. Occupational Safety and Health Administration (OSHA). We find that SWAG-PPM exhibits only modest utility degradation against a non-private comparator while greatly outperforming the industry standard DP-SGD for a similar privacy budget.