Abhishek Roy

ML
h-index33
22papers
402citations
Novelty42%
AI Score47

22 Papers

74.2HCJun 4
Deterring Searches for Child Sexual Abuse Material on Google Search and Promoting Help-Seeking

Rebecca Umbach, Griffin Hunt, John Buckley et al.

Google Search deploys a "Onebox" feature at the top of the results page when users conduct searches for Child Sexual Abuse Material. This study evaluates the impact of a strategic shift in this feature, comparing a revised intervention, focused on repercussions and therapeutic resources, to a previous iteration that focused on reporting. Using a difference-in-differences analysis of internal Google Search logs data, we found the new messaging resulted in a 3.8 percentage point reduction as compared to the status quo in subsequent CSAM-related queries within the same Search session. We found an average click through rate of 0.73% on any of the hyperlinked buttons to help-providing resources. Together, this research presents convergent evidence that a subset of individuals can be deterred from ongoing CSAM-seeking and redirected to therapeutic services.

MLApr 27, 2023
Fairness Uncertainty Quantification: How certain are you that the model is fair?

Abhishek Roy, Prasant Mohapatra

Fairness-aware machine learning has garnered significant attention in recent years because of extensive use of machine learning in sensitive applications like judiciary systems. Various heuristics, and optimization frameworks have been proposed to enforce fairness in classification \cite{del2020review} where the later approaches either provides empirical results or provides fairness guarantee for the exact minimizer of the objective function \cite{celis2019classification}. In modern machine learning, Stochastic Gradient Descent (SGD) type algorithms are almost always used as training algorithms implying that the learned model, and consequently, its fairness properties are random. Hence, especially for crucial applications, it is imperative to construct Confidence Interval (CI) for the fairness of the learned model. In this work we provide CI for test unfairness when a group-fairness-aware, specifically, Disparate Impact (DI), and Disparate Mistreatment (DM) aware linear binary classifier is trained using online SGD-type algorithms. We show that asymptotically a Central Limit Theorem holds for the estimated model parameter of both DI and DM-aware models. We provide online multiplier bootstrap method to estimate the asymptotic covariance to construct online CI. To do so, we extend the known theoretical guarantees shown on the consistency of the online bootstrap method for unconstrained SGD to constrained optimization which could be of independent interest. We illustrate our results on synthetic and real datasets.

OCJun 22, 2022
Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data

Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi

We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study both projection-based and projection-free algorithms. In both cases, we establish that the number of calls to the stochastic first-order oracle to obtain an appropriately defined $ε$-stationary point is of the order $\mathcal{O}(1/ε^{2.5})$. In the projection-free setting we additionally establish that the number of calls to the linear minimization oracle is of order $\mathcal{O}(1/ε^{5.5})$. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.

43.1HCMar 20
"It didn't feel right but I needed a job so desperately": Understanding People's Emotions & Help Needs During Financial Scams

Jake Chanenson, Tara Matthews, Sunny Consolvo et al.

Online financial scams represent a long-standing and serious threat for which people seek help. We present a study to understand people's in situ motivations for engaging with scams and the help needs they express before, during, and after encountering a scam. We identify the main emotions scammers exploited (e.g., fear, hope) and characterize how they did so. We examine factors -- such as financial insecurity and legal precarity -- which elevate people's risk of engaging with specific scams and experiencing harm. We indicate when people sought help and describe their help-seeking needs and emotions at different stages of the scam. We discuss how these needs could be met through the design of contextually-specific prevention, diagnostic, mitigation, and recovery interventions.

STAug 3, 2023
Online covariance estimation for stochastic gradient descent under Markovian sampling

Abhishek Roy, Krishnakumar Balasubramanian

We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. Convergence rates of order $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ are established under state-dependent and state-independent Markovian sampling, respectively, where $d$ is the dimensionality and $n$ denotes observations or SGD iterations. These rates match the best-known convergence rate for independent and identically distributed (i.i.d) data. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. Numerical illustrations provide confidence intervals for SGD in linear and logistic regression models under Markovian sampling. Additionally, our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.

CVNov 26, 2022
Sketch2FullStack: Generating Skeleton Code of Full Stack Website and Application from Sketch using Deep Learning and Computer Vision

Somoy Subandhu Barua, Imam Mohammad Zulkarnain, Abhishek Roy et al.

For a full-stack web or app development, it requires a software firm or more specifically a team of experienced developers to contribute a large portion of their time and resources to design the website and then convert it to code. As a result, the efficiency of the development team is significantly reduced when it comes to converting UI wireframes and database schemas into an actual working system. It would save valuable resources and fasten the overall workflow if the clients or developers can automate this process of converting the pre-made full-stack website design to get a partially working if not fully working code. In this paper, we present a novel approach of generating the skeleton code from sketched images using Deep Learning and Computer Vision approaches. The dataset for training are first-hand sketched images of low fidelity wireframes, database schemas and class diagrams. The approach consists of three parts. First, the front-end or UI elements detection and extraction from custom-made UI wireframes. Second, individual database table creation from schema designs and lastly, creating a class file from class diagrams.

OCAug 4, 2023
Optimization on Pareto sets: On a theory of multi-objective optimization

Abhishek Roy, Geelon So, Yi-An Ma

In multi-objective optimization, a single decision vector must balance the trade-offs between many objectives. Solutions achieving an optimal trade-off are said to be Pareto optimal: these are decision vectors for which improving any one objective must come at a cost to another. But as the set of Pareto optimal vectors can be very large, we further consider a more practically significant Pareto-constrained optimization problem, where the goal is to optimize a preference function constrained to the Pareto set. We investigate local methods for solving this constrained optimization problem, which poses significant challenges because the constraint set is (i) implicitly defined, and (ii) generally non-convex and non-smooth, even when the objectives are. We define notions of optimality and stationarity, and provide an algorithm with a last-iterate convergence rate of $O(K^{-1/2})$ to stationarity when the objectives are strongly convex and Lipschitz smooth.

CLMar 25, 2022
Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach

Satyajit Ghosh, Aniruddha Ghosh, Bittaswer Ghosh et al.

Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.

70.7AIMar 26
Evaluating Language Models for Harmful Manipulation

Canfer Akbulut, Rasmi Elasmar, Abhishek Roy et al.

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

MLFeb 7, 2025
Online Covariance Estimation in Nonsmooth Stochastic Approximation

Liwei Jiang, Abhishek Roy, Krishna Balasubramanian et al.

We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al.(2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.

NIFeb 4, 2022
Predictive Closed-Loop Service Automation in O-RAN based Network Slicing

Joseph Thaliath, Solmaz Niknam, Sukhdeep Singh et al.

Network slicing provides introduces customized and agile network deployment for managing different service types for various verticals under the same infrastructure. To cater to the dynamic service requirements of these verticals and meet the required quality-of-service (QoS) mentioned in the service-level agreement (SLA), network slices need to be isolated through dedicated elements and resources. Additionally, allocated resources to these slices need to be continuously monitored and intelligently managed. This enables immediate detection and correction of any SLA violation to support automated service assurance in a closed-loop fashion. By reducing human intervention, intelligent and closed-loop resource management reduces the cost of offering flexible services. Resource management in a network shared among verticals (potentially administered by different providers), would be further facilitated through open and standardized interfaces. Open radio access network (O-RAN) is perhaps the most promising RAN architecture that inherits all the aforementioned features, namely intelligence, open and standard interfaces, and closed control loop. Inspired by this, in this article we provide a closed-loop and intelligent resource provisioning scheme for O-RAN slicing to prevent SLA violations. In order to maintain realism, a real-world dataset of a large operator is used to train a learning solution for optimizing resource utilization in the proposed closed-loop service automation process. Moreover, the deployment architecture and the corresponding flow that are cognizant of the O-RAN requirements are also discussed.

DBDec 16, 2021
Predictive Price-Performance Optimization for Serverless Query Processing

Rathijit Sen, Abhishek Roy, Alekh Jindal

We present an efficient, parametric modeling framework for predictive resource allocations, focusing on the amount of computational resources, that can optimize for a range of price-performance objectives for data analytics in serverless query processing settings. We discuss and evaluate in depth how our system, AutoExecutor, can use this framework to automatically select near-optimal executor and core counts for Spark SQL queries running on Azure Synapse. Our techniques improve upon Spark's in-built, reactive, dynamic executor allocation capabilities by substantially reducing the total executors allocated and executor occupancy while running queries, thereby freeing up executors that can potentially be used by other concurrent queries or in reducing the overall cluster provisioning needs. In contrast with post-execution analysis tools such as Sparklens, we predict resource allocations for queries before executing them and can also account for changes in input data sizes for predicting the desired allocations.

DBOct 5, 2021
Phoebe: A Learning-based Checkpoint Optimizer

Yiwen Zhu, Matteo Interlandi, Abhishek Roy et al.

Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.

STSep 6, 2021
On Empirical Risk Minimization with Dependent and Heavy-Tailed Data

Abhishek Roy, Krishnakumar Balasubramanian, Murat A. Erdogdu

In this work, we establish risk bounds for the Empirical Risk Minimization (ERM) with both dependent and heavy-tailed data-generating processes. We do so by extending the seminal works of Mendelson [Men15, Men18] on the analysis of ERM with heavy-tailed but independent and identically distributed observations, to the strictly stationary exponentially $β$-mixing case. Our analysis is based on explicitly controlling the multiplier process arising from the interaction between the noise and the function evaluations on inputs. It allows for the interaction to be even polynomially heavy-tailed, which covers a significantly large class of heavy-tailed models beyond what is analyzed in the learning theory literature. We illustrate our results by deriving rates of convergence for the high-dimensional linear regression problem with dependent and heavy-tailed data.

MLSep 28, 2020
Escaping Saddle-Points Faster under Interpolation-like Conditions

Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi et al.

In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $ε$-local-minimizer, matches the corresponding deterministic rate of $\tilde{\mathcal{O}}(1/ε^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $ε$-local-minimizer under interpolation-like conditions, is $\tilde{\mathcal{O}}(1/ε^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $\tilde{\mathcal{O}}(1/ε^{1.5})$ corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.

SPMay 17, 2020
Intelligent O-RAN for Beyond 5G and 6G Wireless Networks

Solmaz Niknam, Abhishek Roy, Harpreet S. Dhillon et al.

Building on the principles of openness and intelligence, there has been a concerted global effort from the operators towards enhancing the radio access network (RAN) architecture. The objective is to build an operator-defined RAN architecture (and associated interfaces) on open hardware that provides intelligent radio control for beyond fifth generation (5G) as well as future sixth generation (6G) wireless networks. Specifically, the open-radio access network (O-RAN) alliance has been formed by merging xRAN forum and C-RAN alliance to formally define the requirements that would help achieve this objective. Owing to the importance of O-RAN in the current wireless landscape, this article provides an introduction to the concepts, principles, and requirements of the Open RAN as specified by the O-RAN alliance. In order to illustrate the role of intelligence in O-RAN, we propose an intelligent radio resource management scheme to handle traffic congestion and demonstrate its efficacy on a real-world dataset obtained from a large operator. A high-level architecture of this deployment scenario that is compliant with the O-RAN requirements is also discussed. The article concludes with key technical challenges and open problems for future research and development.

OCDec 3, 2019
Online and Bandit Algorithms for Nonstationary Stochastic Saddle-Point Optimization

Abhishek Roy, Yifang Chen, Krishnakumar Balasubramanian et al.

Saddle-point optimization problems are an important class of optimization problems with applications to game theory, multi-agent reinforcement learning and machine learning. A majority of the rich literature available for saddle-point optimization has focused on the offline setting. In this paper, we study nonstationary versions of stochastic, smooth, strongly-convex and strongly-concave saddle-point optimization problem, in both online (or first-order) and multi-point bandit (or zeroth-order) settings. We first propose natural notions of regret for such nonstationary saddle-point optimization problems. We then analyze extragradient and Frank-Wolfe algorithms, for the unconstrained and constrained settings respectively, for the above class of nonstationary saddle-point optimization problems. We establish sub-linear regret bounds on the proposed notions of regret in both the online and bandit setting.

LGNov 16, 2019
Suspicion-Free Adversarial Attacks on Clustering Algorithms

Anshuman Chhabra, Abhishek Roy, Prasant Mohapatra

Clustering algorithms are used in a large number of applications and play an important role in modern machine learning-- yet, adversarial attacks on clustering algorithms seem to be broadly overlooked unlike supervised learning. In this paper, we seek to bridge this gap by proposing a black-box adversarial attack for clustering models for linearly separable clusters. Our attack works by perturbing a single sample close to the decision boundary, which leads to the misclustering of multiple unperturbed samples, named spill-over adversarial samples. We theoretically show the existence of such adversarial samples for the K-Means clustering. Our attack is especially strong as (1) we ensure the perturbed sample is not an outlier, hence not detectable, and (2) the exact metric used for clustering is not known to the attacker. We theoretically justify that the attack can indeed be successful without the knowledge of the true metric. We conclude by providing empirical results on a number of datasets, and clustering algorithms. To the best of our knowledge, this is the first work that generates spill-over adversarial samples without the knowledge of the true metric ensuring that the perturbed sample is not an outlier, and theoretically proves the above.

DBAug 30, 2019
Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML

Ashvin Agrawal, Rony Chatterjee, Carlo Curino et al.

Machine learning (ML) has proven itself in high-value web applications such as search ranking and is emerging as a powerful tool in a much broader range of enterprise scenarios including voice recognition and conversational understanding for customer support, autotuning for videoconferencing, intelligent feedback loops in large-scale sysops, manufacturing and autonomous vehicle management, complex financial predictions, just to name a few. Meanwhile, as the value of data is increasingly recognized and monetized, concerns about securing valuable data and risks to individual privacy have been growing. Consequently, rigorous data management has emerged as a key requirement in enterprise settings. How will these trends (ML growing popularity, and stricter data governance) intersect? What are the unmet requirements for applying ML in enterprise settings? What are the technical challenges for the DB community to solve? In this paper, we present our vision of how ML and database systems are likely to come together, and early steps we take towards making this vision a reality.

MLJul 31, 2019
Multi-Point Bandit Algorithms for Nonstationary Online Nonconvex Optimization

Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi et al.

Bandit algorithms have been predominantly analyzed in the convex setting with function-value based stationary regret as the performance measure. In this paper, motivated by online reinforcement learning problems, we propose and analyze bandit algorithms for both general and structured nonconvex problems with nonstationary (or dynamic) regret as the performance measure, in both stochastic and non-stochastic settings. First, for general nonconvex functions, we consider nonstationary versions of first-order and second-order stationary solutions as a regret measure, motivated by similar performance measures for offline nonconvex optimization. In the case of second-order stationary solution based regret, we propose and analyze online and bandit versions of the cubic regularized Newton's method. The bandit version is based on estimating the Hessian matrices in the bandit setting, based on second-order Gaussian Stein's identity. Our nonstationary regret bounds in terms of second-order stationary solutions have interesting consequences for avoiding saddle points in the bandit setting. Next, for weakly quasi convex functions and monotone weakly submodular functions we consider nonstationary regret measures in terms of function-values; such structured classes of nonconvex functions enable one to consider regret measure defined in terms of function values, similar to convex functions. For this case of function-value, and first-order stationary solution based regret measures, we provide regret bounds in both the low- and high-dimensional settings, for some scenarios.

STFeb 4, 2019
Stochastic Zeroth-order Discretizations of Langevin Diffusions for Bayesian Inference

Abhishek Roy, Lingqing Shen, Krishnakumar Balasubramanian et al.

Discretizations of Langevin diffusions provide a powerful method for sampling and Bayesian inference. However, such discretizations require evaluation of the gradient of the potential function. In several real-world scenarios, obtaining gradient evaluations might either be computationally expensive, or simply impossible. In this work, we propose and analyze stochastic zeroth-order sampling algorithms for discretizing overdamped and underdamped Langevin diffusions. Our approach is based on estimating the gradients, based on Gaussian Stein's identities, widely used in the stochastic optimization literature. We provide a comprehensive sample complexity analysis -- number noisy function evaluations to be made to obtain an $ε$-approximate sample in Wasserstein distance -- of stochastic zeroth-order discretizations of both overdamped and underdamped Langevin diffusions, under various noise models. We also propose a variable selection technique based on zeroth-order gradient estimates and establish its theoretical guarantees. Our theoretical contributions extend the practical applicability of sampling algorithms to the noisy black-box and high-dimensional settings.

LGJan 28, 2019
Strong Black-box Adversarial Attacks on Unsupervised Machine Learning Models

Anshuman Chhabra, Abhishek Roy, Prasant Mohapatra

Machine Learning (ML) and Deep Learning (DL) models have achieved state-of-the-art performance on multiple learning tasks, from vision to natural language modelling. With the growing adoption of ML and DL to many areas of computer science, recent research has also started focusing on the security properties of these models. There has been a lot of work undertaken to understand if (deep) neural network architectures are resilient to black-box adversarial attacks which craft perturbed input samples that fool the classifier without knowing the architecture used. Recent work has also focused on the transferability of adversarial attacks and found that adversarial attacks are generally easily transferable between models, datasets, and techniques. However, such attacks and their analysis have not been covered from the perspective of unsupervised machine learning algorithms. In this paper, we seek to bridge this gap through multiple contributions. We first provide a strong (iterative) black-box adversarial attack that can craft adversarial samples which will be incorrectly clustered irrespective of the choice of clustering algorithm. We choose 4 prominent clustering algorithms, and a real-world dataset to show the working of the proposed adversarial algorithm. Using these clustering algorithms we also carry out a simple study of cross-technique adversarial attack transferability.