Anwitaman Datta

h-index33

15papers

102citations

Novelty35%

AI Score37

Ranked #94,426 of 194,257 authors (top 49%)#2,317 in CR (top 34%)

15 Papers

1.7CLAug 25, 2023

Misinformation Concierge: A Proof-of-Concept with Curated Twitter Dataset on COVID-19 Vaccination

Shakshi Sharma, Anwitaman Datta, Vigneshwaran Shankaran et al.

We demonstrate the Misinformation Concierge, a proof-of-concept that provides actionable intelligence on misinformation prevalent in social media. Specifically, it uses language processing and machine learning tools to identify subtopics of discourse and discern non/misleading posts; presents statistical reports for policy-makers to understand the big picture of prevalent misinformation in a timely manner; and recommends rebuttal messages for specific pieces of misinformation, identified from within the corpus of data - providing means to intervene and counter misinformation promptly. The Misinformation Concierge proof-of-concept using a curated dataset is accessible at: https://demo-frontend-uy34.onrender.com/

3.9AIOct 29, 2023

AMIR: Automated MisInformation Rebuttal -- A COVID-19 Vaccination Datasets based Recommendation System

Shakshi Sharma, Anwitaman Datta, Rajesh Sharma

Misinformation has emerged as a major societal threat in recent years in general; specifically in the context of the COVID-19 pandemic, it has wrecked havoc, for instance, by fuelling vaccine hesitancy. Cost-effective, scalable solutions for combating misinformation are the need of the hour. This work explored how existing information obtained from social media and augmented with more curated fact checked data repositories can be harnessed to facilitate automated rebuttal of misinformation at scale. While the ideas herein can be generalized and reapplied in the broader context of misinformation mitigation using a multitude of information sources and catering to the spectrum of social media platforms, this work serves as a proof of concept, and as such, it is confined in its scope to only rebuttal of tweets, and in the specific context of misinformation regarding COVID-19. It leverages two publicly available datasets, viz. FaCov (fact-checked articles) and misleading (social media Twitter) data on COVID-19 Vaccination.

1.2OHFeb 20

How international are international computing conferences? -- An exploration with systems research conferences

Pedro Garcia Lopez, Marina López Alet, Usama Benabdelkrim Zakan et al.

In recent years, Asia's rapid growth in research output has been reshaping the computing research landscape. What was once a two-block system (America and Europe) is evolving into a multipolar world with three major hubs: America, Europe, and Asia. To study these pivotal changes and evaluate international diversity, we have analyzed the past 13 years of 13 international systems research conferences: ASPLOS, NSDI, OSDI, SIGCOMM, ATC, EuroSys, ICDCS, Middleware, SoCC, CCGRID, IC2E, IEEE Cloud and EuroPar. Our analysis focuses on accepted papers and participation in the Program Committee, grouping the results by region (America, Europe, and Asia). Surprisingly, we find a pronounced historical imbalance in international diversity among top-tier systems conferences (ASPLOS, OSDI, NSDI, SIGCOMM). While most other conferences have progressively reflected Asia's growing research presence over the past decades, this group has shown a noticeable adjustment only in the recent four years. We also identify persistent rigidities in how program committee (PC) diversity adapts to shifts in accepted paper origins, with a consistent under-representation of researchers from Asian organizations in many PCs.

2.3DBMay 27, 2013Code

Streamforce: outsourcing access control enforcement for stream data to the clouds

Tien Tuan Anh Dinh, Anwitaman Datta

As tremendous amount of data being generated everyday from human activity and from devices equipped with sensing capabilities, cloud computing emerges as a scalable and cost-effective platform to store and manage the data. While benefits of cloud computing are numerous, security concerns arising when data and computation are outsourced to a third party still hinder the complete movement to the cloud. In this paper, we focus on the problem of data privacy on the cloud, particularly on access controls over stream data. The nature of stream data and the complexity of sharing data make access control a more challenging issue than in traditional archival databases. We present Streamforce - a system allowing data owners to securely outsource their data to the cloud. The owner specifies fine-grained policies which are enforced by the cloud. The latter performs most of the heavy computations, while learning nothing about the data. To this end, we employ a number of encryption schemes, including deterministic encryption, proxy-based attribute based encryption and sliding-window encryption. In Streamforce, access control policies are modeled as secure continuous queries, which entails minimal changes to existing stream processing engines, and allows for easy expression of a wide-range of policies. In particular, Streamforce comes with a number of secure query operators including Map, Filter, Join and Aggregate. Finally, we implement Streamforce over an open source stream processing engine (Esper) and evaluate its performance on a cloud platform. The results demonstrate practical performance for many real-world applications, and although the security overhead is visible, Streamforce is highly scalable.

7.9LGJan 4, 2024

eCIL-MU: Embedding based Class Incremental Learning and Machine Unlearning

Zhiwei Zuo, Zhuo Tang, Bin Wang et al.

New categories may be introduced over time, or existing categories may need to be reclassified. Class incremental learning (CIL) is employed for the gradual acquisition of knowledge about new categories while preserving information about previously learned ones in such dynamic environments. It might also be necessary to also eliminate the influence of related categories on the model to adapt to reclassification. We thus introduce class-level machine unlearning (MU) within CIL. Typically, MU methods tend to be time-consuming and can potentially harm the model's performance. A continuous stream of unlearning requests could lead to catastrophic forgetting. To address these issues, we propose a non-destructive eCIL-MU framework based on embedding techniques to map data into vectors and then be stored in vector databases. Our approach exploits the overlap between CIL and MU tasks for acceleration. Experiments demonstrate the capability of achieving unlearning effectiveness and orders of magnitude (upto $\sim 278\times$) of acceleration.

9.2LGJan 9, 2024

Machine unlearning through fine-grained model parameters perturbation

Zhiwei Zuo, Zhuo Tang, Kenli Li et al.

Machine unlearning techniques, which involve retracting data records and reducing influence of said data on trained models, help with the user privacy protection objective but incur significant computational costs. Weight perturbation-based unlearning is a general approach, but it typically involves globally modifying the parameters. We propose fine-grained Top-K and Random-k parameters perturbed inexact machine unlearning strategies that address the privacy needs while keeping the computational costs tractable. In order to demonstrate the efficacy of our strategies we also tackle the challenge of evaluating the effectiveness of machine unlearning by considering the model's generalization performance across both unlearning and remaining data. To better assess the unlearning effect and model generalization, we propose novel metrics, namely, the forgetting rate and memory retention rate. However, for inexact machine unlearning, current metrics are inadequate in quantifying the degree of forgetting that occurs after unlearning strategies are applied. To address this, we introduce SPD-GAN, which subtly perturbs the distribution of data targeted for unlearning. Then, we evaluate the degree of unlearning by measuring the performance difference of the models on the perturbed unlearning data before and after the unlearning process. By implementing these innovative techniques and metrics, we achieve computationally efficacious privacy protection in machine learning applications without significant sacrifice of model performance. Furthermore, this approach provides a novel method for evaluating the degree of unlearning.

1.2SINov 23, 2021

A Modular Framework for Centrality and Clustering in Complex Networks

Frederique Oggier, Silivanxay Phetsouvanh, Anwitaman Datta

The structure of many complex networks includes edge directionality and weights on top of their topology. Network analysis that can seamlessly consider combination of these properties are desirable. In this paper, we study two important such network analysis techniques, namely, centrality and clustering. An information-flow based model is adopted for clustering, which itself builds upon an information theoretic measure for computing centrality. Our principal contributions include a generalized model of Markov entropic centrality with the flexibility to tune the importance of node degrees, edge weights and directions, with a closed-form asymptotic analysis. It leads to a novel two-stage graph clustering algorithm. The centrality analysis helps reason about the suitability of our approach to cluster a given graph, and determine `query' nodes, around which to explore local community structures, leading to an agglomerative clustering mechanism. The entropic centrality computations are amortized by our clustering algorithm, making it computationally efficient: compared to prior approaches using Markov entropic centrality for clustering, our experiments demonstrate multiple orders of magnitude of speed-up. Our clustering algorithm naturally inherits the flexibility to accommodate edge directionality, as well as different interpretations and interplay between edge weights and node degrees. Overall, this paper thus not only makes significant theoretical and conceptual contributions, but also translates the findings into artifacts of practical relevance, yielding new, effective and scalable centrality computations and graph clustering algorithms, whose efficacy has been validated through extensive benchmarking experiments.

1.2CLAug 16, 2021Code

Misleading the Covid-19 vaccination discourse on Twitter: An exploratory study of infodemic around the pandemic

Shakshi Sharma, Rajesh Sharma, Anwitaman Datta

In this work, we collect a moderate-sized representative corpus of tweets (200,000 approx.) pertaining Covid-19 vaccination spanning over a period of seven months (September 2020 - March 2021). Following a Transfer Learning approach, we utilize the pre-trained Transformer-based XLNet model to classify tweets as Misleading or Non-Misleading and validate against a random subset of results manually. We build on this to study and contrast the characteristics of tweets in the corpus that are misleading in nature against non-misleading ones. This exploratory analysis enables us to design features (such as sentiments, hashtags, nouns, pronouns, etc) that can, in turn, be exploited for classifying tweets as (Non-)Misleading using various ML models in an explainable manner. Specifically, several ML models are employed for prediction, with up to 90% accuracy, and the importance of each feature is explained using SHAP Explainable AI (XAI) tool. While the thrust of this work is principally exploratory analysis in order to obtain insights on the online discourse on Covid-19 vaccination, we conclude the paper by outlining how these insights provide the foundations for a more actionable approach to mitigate misinformation. The curated dataset and code is made available (Github repository) so that the research community at large can reproduce, compare against, or build upon this work.

3.3CYMay 21, 2019

Blockchain in the Government Technology Fabric

Anwitaman Datta

Fuelled by the success (and hype) around cryptocurrencies, distributed ledger technologies (DLT), particularly blockchains, have gained a lot of attention from a wide spectrum of audience who perceive blockchains as a key to carry out business processes that have hitherto been cumbersome in a cost and time effective manner. Governments across the globe have responded to this promising but nascent technology differently - from being apathetic or adopting a wait-and-watch approach: letting the systems shape themselves, to creating regulatory sandboxes and sponsoring capacity building, or in some instances (arguably) over-regulating and attempting to put the blockchain genie back in the bottle. Possible government role spans across a spectrum: regulating crypto-currencies and initial coin offerings (ICO), formulating regulatory frameworks for managing the adoption of blockchains, particularly in critical infrastructure industries, facilitating capacity building, and finally, embracing blockchain technology in conducting the activities of the government itself - be it internally, or in using them to deliver public services. In this paper we survey the last, namely, the use of blockchain and associated distributed ledger technologies in the government technology (GovTech) stack, and discuss the merits and concerns associated with the existing initiatives and approaches.

2.3DCApr 30, 2019

Please, do not decentralize the Internet with (permissionless) blockchains!

Pedro Garcia Lopez, Alberto Montresor, Anwitaman Datta

The old mantra of decentralizing the Internet is coming again with fanfare, this time around the blockchain technology hype. We have already seen a technology supposed to change the nature of the Internet: peer-to-peer. The reality is that peer-to-peer naming systems failed, peer-to-peer social networks failed, and yes, peer-to-peer storage failed as well. In this paper, we will review the research on distributed systems in the last few years to identify the limits of open peer-to-peer networks. We will address issues like system complexity, security and frailty, instability and performance. We will show how many of the aforementioned problems also apply to the recent breed of permissionless blockchain networks. The applicability of such systems to mature industrial applications is undermined by the same properties that make them so interesting for a libertarian audience: namely, their openness, their pseudo-anonymity and their unregulated cryptocurrencies. As such, we argue that permissionless blockchain networks are unsuitable to be the substrate for a decentralized Internet. Yet, there is still hope for more decentralization, albeit in a form somewhat limited with respect to the libertarian view of decentralized Internet: in cooperation rather than in competition with the superpowerful datacenters that dominate the world today. This is derived from the recent surge in interest in byzantine fault tolerance and permissioned blockchains, which opens the door to a world where use of trusted third parties is not the only way to arbitrate an ensemble of entities. The ability of establish trust through permissioned blockchains enables to move the control from the datacenters to the edge, truly realizing the promises of edge-centric computing.

3.2CRJul 31, 2015

Auditable Versioned Data Storage Outsourcing

Ertem Esiner, Anwitaman Datta

Auditability is crucial for data outsourcing, facilitating accountability and identifying data loss or corruption incidents in a timely manner, reducing in turn the risks from such losses. In recent years, in synch with the growing trend of outsourcing, a lot of progress has been made in designing probabilistic (for efficiency) provable data possession (PDP) schemes. However, even the recent and advanced PDP solutions that do deal with dynamic data, do so in a limited manner, and for only the latest version of the data. A naive solution treating different versions in isolation would work, but leads to tremendous overheads, and is undesirable. In this paper, we present algorithms to achieve full persistence (all intermediate configurations are preserved and are modifiable) for an optimized skip list (known as FlexList) so that versioned data can be audited. The proposed scheme provides deduplication at the level of logical, variable sized blocks, such that only the altered parts of the different versions are kept, while the persistent data-structure facilitates access (read) of any arbitrary version with the same storage and process efficiency that state-of-the-art dynamic PDP solutions provide for only the current version, while commit (write) operations incur around 5% additional time. Furthermore, the time overhead for auditing arbitrary versions in addition to the latest version is imperceptible even on a low-end server...

1.2SIMar 4, 2013

The Zen of Multidisciplinary Team Recommendation

Anwitaman Datta, Stefano Braghin, Jackson Tan Teck Yong

In order to accomplish complex tasks, it is often necessary to compose a team consisting of experts with diverse competencies. However, for proper functioning, it is also preferable that a team be socially cohesive. A team recommendation system, which facilitates the search for potential team members can be of great help both for (i) individuals who need to seek out collaborators and (ii) managers who need to build a team for some specific tasks. A decision support system which readily helps summarize such metrics, and possibly rank the teams in a personalized manner according to the end users' preferences, can be a great tool to navigate what would otherwise be an information avalanche. In this work we present a general framework of how to compose such subsystems together to build a composite team recommendation system, and instantiate it for a case study of academic teams.

3.0CROct 2, 2012

Stream on the Sky: Outsourcing Access Control Enforcement for Stream Data to the Cloud

Tien Tuan Anh Dinh, Anwitaman Datta

There is an increasing trend for businesses to migrate their systems towards the cloud. Security concerns that arise when outsourcing data and computation to the cloud include data confidentiality and privacy. Given that a tremendous amount of data is being generated everyday from plethora of devices equipped with sensing capabilities, we focus on the problem of access controls over live streams of data based on triggers or sliding windows, which is a distinct and more challenging problem than access control over archival data. Specifically, we investigate secure mechanisms for outsourcing access control enforcement for stream data to the cloud. We devise a system that allows data owners to specify fine-grained policies associated with their data streams, then to encrypt the streams and relay them to the cloud for live processing and storage for future use. The access control policies are enforced by the cloud, without the latter learning about the data, while ensuring that unauthorized access is not feasible. To realize these ends, we employ a novel cryptographic primitive, namely proxy-based attribute-based encryption, which not only provides security but also allows the cloud to perform expensive computations on behalf of the users. Our approach is holistic, in that these controls are integrated with an XML based framework (XACML) for high-level management of policies. Experiments with our prototype demonstrate the feasibility of such mechanisms, and early evaluations suggest graceful scalability with increasing numbers of policies, data streams and users.

3.0CRJun 10, 2012

CloudMine: Multi-Party Privacy-Preserving Data Analytics Service

Dinh Tien Tuan Anh, Quach Vinh Thanh, Anwitaman Datta

An increasing number of businesses are replacing their data storage and computation infrastructure with cloud services. Likewise, there is an increased emphasis on performing analytics based on multiple datasets obtained from different data sources. While ensuring security of data and computation outsourced to a third party cloud is in itself challenging, supporting analytics using data distributed across multiple, independent clouds is even further from trivial. In this paper we present CloudMine, a cloud-based service which allows multiple data owners to perform privacy-preserved computation over the joint data using their clouds as delegates. CloudMine protects data privacy with respect to semi-honest data owners and semi-honest clouds. It furthermore ensures the privacy of the computation outputs from the curious clouds. It allows data owners to reliably detect if their cloud delegates have been lazy when carrying out the delegated computation. CloudMine can run as a centralized service on a single cloud, or as a distributed service over multiple, independent clouds. CloudMine supports a set of basic computations that can be used to construct a variety of highly complex, distributed privacy-preserving data analytics. We demonstrate how a simple instance of CloudMine (secure sum service) is used to implement three classical data mining tasks (classification, association rule mining and clustering) in a cloud environment. We experiment with a prototype of the service, the results of which suggest its practicality for supporting privacy-preserving data analytics as a (multi) cloud-based service.

5.3CRMay 29, 2012

Cloud and the City: Facilitating Flexible Access Control over Data Streams

Wen Qiang Wang, Dinh Tien Tuan Anh, Hock Beng Lim et al.

The proliferation of sensing devices create plethora of data-streams, which in turn can be harnessed to carry out sophisticated analytics to support various real-time applications and services as well as long-term planning, e.g., in the context of intelligent cities or smart homes to name a few prominent ones. A mature cloud infrastructure brings such a vision closer to reality than ever before. However, we believe that the ability for data-owners to flexibly and easily to control the granularity at which they share their data with other entities is very important - in making data owners feel comfortable to share to start with, and also to leverage on such fine-grained control to realize different business models or logics. In this paper, we explore some basic operations to flexibly control the access on a data stream and propose a framework eXACML+ that extends OASIS's XACML model to achieve the same. We develop a prototype using the commercial StreamBase engine to demonstrate a seamless combination of stream data processing with (a small but important selected set of) fine-grained access control mechanisms, and study the framework's efficacy based on experiments in cloud like environments.