26.0CLApr 16Code
Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set AuthoringSumit Mukherjee, Juan Shu, Nairwita Mazumder et al.
Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.
LGOct 30, 2023
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular DataMayana Pereira, Meghana Kshirsagar, Sumit Mukherjee et al.
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generator MWEM PGM can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.
SYDec 18, 2012
Identification of Nonlinear Systems From the Knowledge Around Different Operating Conditions: A Feed-Forward Multi-Layer ANN Based ApproachSayan Saha, Saptarshi Das, Anish Acharya et al.
The paper investigates nonlinear system identification using system output data at various linearized operating points. A feed-forward multi-layer Artificial Neural Network (ANN) based approach is used for this purpose and tested for two target applications i.e. nuclear reactor power level monitoring and an AC servo position control system. Various configurations of ANN using different activation functions, number of hidden layers and neurons in each layer are trained and tested to find out the best configuration. The training is carried out multiple times to check for consistency and the mean and standard deviation of the root mean square errors (RMSE) are reported for each configuration.
3.3GTApr 21
Geometric Comparisons of Electoral Rules Under FeedbackSumit Mukherjee
We study how electoral rules shape polarization dynamics when voters and candidates both adapt to repeated election outcomes. We introduce two geometric primitives for comparing rules under this feedback: the \emph{winner radius} $R_t = \max_i \|x_i - w^{(t)}\|$, the distance from the winner to the farthest voter, and the \emph{supporter centroid radius} $S_t = \max_j \|c_j - s_j^{(t)}\|$, the largest gap between any candidate and their support base. We show that $R_t$ controls a one-step contraction bound on voter disagreement and $S_t$ plays the analogous role for candidate dispersion, and that these two objectives are in tension. Rules that reduce $R_t$ tend to increase $S_t$, and vice versa. A winner close to the voter median does not resolve the tension, since proximity to the median and proximity to the Chebyshev center are different objectives. We use this framing to organize a simulation study across seven standard electoral rules and one convex-combination benchmark, comprising 1000+ runs across diverse electorate profiles, voter mechanisms, and camp-balance settings. The empirical results confirm the theoretical tradeoff: winner-take-all rules achieve small $S_t$ at the cost of large $R_t$ and weaker voter depolarization, while convex-combination rules reverse this. An oracle comparison further shows that minimizing $R_t$ per step and minimizing voter disagreement per step are distinct objectives with different long-run consequences for both voter and candidate dynamics.
CVOct 15, 2024
Synthesizing Proton-Density Fat Fraction and $R_2^*$ from 2-point Dixon MRI with Generative Machine LearningSuma Anand, Kaiwen Xu, Colm O'Dushlaine et al.
Magnetic Resonance Imaging (MRI) is the gold standard for measuring fat and iron content non-invasively in the body via measures known as Proton Density Fat Fraction (PDFF) and $R_2^*$, respectively. However, conventional PDFF and $R_2^*$ quantification methods operate on MR images voxel-wise and require at least three measurements to estimate three quantities: water, fat, and $R_2^*$. Alternatively, the two-point Dixon MRI protocol is widely used and fast because it acquires only two measurements; however, these cannot be used to estimate three quantities voxel-wise. Leveraging the fact that neighboring voxels have similar values, we propose using a generative machine learning approach to learn PDFF and $R_2^*$ from Dixon MRI. We use paired Dixon-IDEAL data from UK Biobank in the liver and a Pix2Pix conditional GAN to demonstrate the first large-scale $R_2^*$ imputation from two-point Dixon MRIs. Using our proposed approach, we synthesize PDFF and $R_2^*$ maps that show significantly greater correlation with ground-truth than conventional voxel-wise baselines.
MLJun 15, 2021
An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected SurprisesMayana Pereira, Meghana Kshirsagar, Sumit Mukherjee et al.
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.
CVJun 9, 2021
A machine learning pipeline for aiding school identification from child trafficking imagesSumit Mukherjee, Tina Sederholm, Anthony C. Roman et al.
Child trafficking in a serious problem around the world. Every year there are more than 4 million victims of child trafficking around the world, many of them for the purposes of child sexual exploitation. In collaboration with UK Police and a non-profit focused on child abuse prevention, Global Emancipation Network, we developed a proof-of-concept machine learning pipeline to aid the identification of children from intercepted images. In this work, we focus on images that contain children wearing school uniforms to identify the school of origin. In the absence of a machine learning pipeline, this hugely time consuming and labor intensive task is manually conducted by law enforcement personnel. Thus, by automating aspects of the school identification process, we hope to significantly impact the speed of this portion of child identification. Our proposed pipeline consists of two machine learning models: i) to identify whether an image of a child contains a school uniform in it, and ii) identification of attributes of different school uniform items (such as color/texture of shirts, sweaters, blazers etc.). We describe the data collection, labeling, model development and validation process, along with strategies for efficient searching of schools using the model predictions.
STApr 25, 2021
Variational Inference in high-dimensional linear regressionSumit Mukherjee, Subhabrata Sen
We study high-dimensional Bayesian linear regression with product priors. Using the nascent theory of non-linear large deviations (Chatterjee and Dembo,2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant of the posterior. Furthermore, we establish that under an additional "separation" condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this "separation" condition. Finally, we illustrate our results on concrete examples with specific design matrices.
MLJan 18, 2021
Reducing bias and increasing utility by federated generative modeling of medical images using a centralized adversaryJean-Francois Rajotte, Sumit Mukherjee, Caleb Robinson et al.
We introduce FELICIA (FEderated LearnIng with a CentralIzed Adversary) a generative mechanism enabling collaborative learning. In particular, we show how a data owner with limited and biased data could benefit from other data owners while keeping data from all the sources private. This is a common scenario in medical image analysis where privacy legislation prevents data from being shared outside local premises. FELICIA works for a large family of Generative Adversarial Networks (GAN) architectures including vanilla and conditional GANs as demonstrated in this work. We show that by using the FELICIA mechanism, a data owner with limited image samples can generate high-quality synthetic images with high utility while neither data owners has to provide access to its data. The sharing happens solely through a central discriminator that has access limited to synthetic data. Here, utility is defined as classification performance on a real test set. We demonstrate these benefits on several realistic healthcare scenarions using benchmark image datasets (MNIST, CIFAR-10) as well as on medical images for the task of skin lesion classification. With multiple experiments, we show that even in the worst cases, combining FELICIA with real data gracefully achieves performance on par with real data while most results significantly improves the utility.
CRSep 11, 2020
MACE: A Flexible Framework for Membership Privacy Estimation in Generative ModelsYixi Xu, Sumit Mukherjee, Xiyang Liu et al.
Generative machine learning models are being increasingly viewed as a way to share sensitive data between institutions. While there has been work on developing differentially private generative modeling approaches, these approaches generally lead to sub-par sample quality, limiting their use in real world applications. Another line of work has focused on developing generative models which lead to higher quality samples but currently lack any formal privacy guarantees. In this work, we propose the first formal framework for membership privacy estimation in generative models. We formulate the membership privacy risk as a statistical divergence between training samples and hold-out samples, and propose sample-based methods to estimate this divergence. Compared to previous works, our framework makes more realistic and flexible assumptions. First, we offer a generalizable metric as an alternative to the accuracy metric especially for imbalanced datasets. Second, we loosen the assumption of having full access to the underlying distribution from previous studies , and propose sample-based estimations with theoretical guarantees. Third, along with the population-level membership privacy risk estimation via the optimal membership advantage, we offer the individual-level estimation via the individual privacy risk. Fourth, our framework allows adversaries to access the trained model via a customized query, while prior works require specific attributes.
LGDec 31, 2019
privGAN: Protecting GANs from membership inference attacks at low costSumit Mukherjee, Yixi Xu, Anusua Trivedi et al.
Generative Adversarial Networks (GANs) have made releasing of synthetic images a viable approach to share data without releasing the original dataset. It has been shown that such synthetic data can be used for a variety of downstream tasks such as training classifiers that would otherwise require the original dataset to be shared. However, recent work has shown that the GAN models and their synthetically generated data can be used to infer the training set membership by an adversary who has access to the entire dataset and some auxiliary information. Current approaches to mitigate this problem (such as DPGAN) lead to dramatically poorer generated sample quality than the original non--private GANs. Here we develop a new GAN architecture (privGAN), where the generator is trained not only to cheat the discriminator but also to defend membership inference attacks. The new mechanism provides protection against this mode of attack while leading to negligible loss in downstream performances. In addition, our algorithm has been shown to explicitly prevent overfitting to the training set, which explains why our protection is so effective. The main contributions of this paper are: i) we propose a novel GAN architecture that can generate synthetic data in a privacy preserving manner without additional hyperparameter tuning and architecture selection, ii) we provide a theoretical understanding of the optimal solution of the privGAN loss function, iii) we demonstrate the effectiveness of our model against several white and black--box attacks on several benchmark datasets, iv) we demonstrate on three common benchmark datasets that synthetic images generated by privGAN lead to negligible loss in downstream performance when compared against non--private GANs.
LGOct 4, 2019
Risks of Using Non-verified Open Data: A case study on using Machine Learning techniques for predicting Pregnancy Outcomes in IndiaAnusua Trivedi, Sumit Mukherjee, Edmund Tse et al.
Artificial intelligence (AI) has evolved considerably in the last few years. While applications of AI is now becoming more common in fields like retail and marketing, application of AI in solving problems related to developing countries is still an emerging topic. Specially, AI applications in resource-poor settings remains relatively nascent. There is a huge scope of AI being used in such settings. For example, researchers have started exploring AI applications to reduce poverty and deliver a broad range of critical public services. However, despite many promising use cases, there are many dataset related challenges that one has to overcome in such projects. These challenges often take the form of missing data, incorrectly collected data and improperly labeled variables, among other factors. As a result, we can often end up using data that is not representative of the problem we are trying to solve. In this case study, we explore the challenges of using such an open dataset from India, to predict an important health outcome. We highlight how the use of AI without proper understanding of reporting metrics can lead to erroneous conclusions.
SYJan 5, 2013
Comparative Studies on Decentralized Multiloop PID Controller Design Using Evolutionary AlgorithmsSayan Saha, Saptarshi Das, Anindya Pakhira et al.
Decentralized PID controllers have been designed in this paper for simultaneous tracking of individual process variables in multivariable systems under step reference input. The controller design framework takes into account the minimization of a weighted sum of Integral of Time multiplied Squared Error (ITSE) and Integral of Squared Controller Output (ISCO) so as to balance the overall tracking errors for the process variables and required variation in the corresponding manipulated variables. Decentralized PID gains are tuned using three popular Evolutionary Algorithms (EAs) viz. Genetic Algorithm (GA), Evolutionary Strategy (ES) and Cultural Algorithm (CA). Credible simulation comparisons have been reported for four benchmark 2x2 multivariable processes.