Ritwik Banerjee

CL
h-index8
5papers
31citations
Novelty44%
AI Score53

5 Papers

14.6CYMay 27
Local Privacy Laws in a Globalized World

Shantanu Sharma, Ethan Myers, Lorenzo De Carli et al.

Personal data has emerged as a highly valuable yet sensitive asset that drives business decisions, enables targeted advertising, and generates substantial revenue for companies, while simultaneously facilitating invasive monitoring of users. In recent years, research on digital privacy violations, including undue access, collection, and sharing of user data, has grown significantly. Much of this research adopts the European General Data Protection Regulation (GDPR) as the primary reference framework. This is reasonable, as GDPR was a pioneering legislation, and many of its stipulations are clear and unambiguous. However, we argue that focusing solely on GDPR (and a small set of other Western regulatory frameworks) ignores privacy-related concerns, attitudes, and problems faced by users from other locales, creating a significant research blind spot. This work systematically normalizes the heterogeneous legal requirements of multiple data protection laws into a unified abstraction aligned with the data lifecycle, which forms the foundation for the implementation of such regulations. We further investigate the implications of these laws on different stakeholders, including users, organizations, and governments. Overall, this work aims to broaden the digital privacy research community's perspective and to serve as a set of guiding principles for developing technological privacy solutions spanning multiple countries.

CLJan 22
A Longitudinal, Multinational, and Multilingual Corpus of News Coverage of the Russo-Ukrainian War

Dikshya Mohanty, Taisiia Sabadyn, Jelwin Rodrigues et al.

We introduce DNIPRO, a novel longitudinal corpus of 246K news articles documenting the Russo-Ukrainian war from Feb 2022 to Aug 2024, spanning eleven media outlets across five nation states (Russia, Ukraine, U.S., U.K., and China) and three languages (English, Russian, and Mandarin Chinese). This multilingual resource features consistent and comprehensive metadata, and multiple types of annotation with rigorous human evaluations for downstream tasks relevant to systematic transnational analyses of contentious wartime discourse. DNIPRO's distinctive value lies in its inclusion of competing geopolitical perspectives, making it uniquely suited for studying narrative divergence, media framing, and information warfare. To demonstrate its utility, we include use case experiments using stance detection, sentiment analysis, topical framing, and contradiction analysis of major conflict events within the larger war. Our explorations reveal how outlets construct competing realities, with coverage exhibiting polarized interpretations that reflect geopolitical interests. Beyond supporting computational journalism research, DNIPRO provides a foundational resource for understanding how conflicting narratives emerge and evolve across global information ecosystems.

SDSep 20, 2025
Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies

Vishnu Raja, Adithya V Ganesan, Anand Syamkumar et al.

State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling strategies that can both generalize and handle idiosyncracy could be more effective for capturing atypical speech. To investigate this, we compare four strategies: (a) $\textit{normative}$ models trained on typical speech (no personalization), (b) $\textit{idiosyncratic}$ models completely personalized to individuals, (c) $\textit{dysarthric-normative}$ models trained on other dysarthric speakers, and (d) $\textit{dysarthric-idiosyncratic}$ models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than idiosyncratic approach while requiring less than half as much personalized data (36.43 WER with 128 train size vs 36.99 with 256). Further, we found that tuning the speech encoder alone (as opposed to the LM decoder) yielded the best results reducing word error rate from 71% to 32% on average. Our findings highlight the value of leveraging both normative (cross-speaker) and idiosyncratic (speaker-specific) patterns to improve ASR for underrepresented speech populations.

CLMay 17, 2025
Class Distillation with Mahalanobis Contrast: An Efficient Training Paradigm for Pragmatic Language Understanding Tasks

Chenlu Wang, Weimin Lyu, Ritwik Banerjee

Detecting deviant language such as sexism, or nuanced language such as metaphors or sarcasm, is crucial for enhancing the safety, clarity, and interpretation of online social discourse. While existing classifiers deliver strong results on these tasks, they often come with significant computational cost and high data demands. In this work, we propose \textbf{Cla}ss \textbf{D}istillation (ClaD), a novel training paradigm that targets the core challenge: distilling a small, well-defined target class from a highly diverse and heterogeneous background. ClaD integrates two key innovations: (i) a loss function informed by the structural properties of class distributions, based on Mahalanobis distance, and (ii) an interpretable decision algorithm optimized for class separation. Across three benchmark detection tasks -- sexism, metaphor, and sarcasm -- ClaD outperforms competitive baselines, and even with smaller language models and orders of magnitude fewer parameters, achieves performance comparable to several large language models (LLMs). These results demonstrate ClaD as an efficient tool for pragmatic language understanding tasks that require gleaning a small target class from a larger heterogeneous background.

CLFeb 15, 2024
Paying Attention to Deflections: Mining Pragmatic Nuances for Whataboutism Detection in Online Discourse

Khiem Phi, Noushin Salek Faramarzi, Chenlu Wang et al.

Whataboutism, a potent tool for disrupting narratives and sowing distrust, remains under-explored in quantitative NLP research. Moreover, past work has not distinguished its use as a strategy for misinformation and propaganda from its use as a tool for pragmatic and semantic framing. We introduce new datasets from Twitter and YouTube, revealing overlaps as well as distinctions between whataboutism, propaganda, and the tu quoque fallacy. Furthermore, drawing on recent work in linguistic semantics, we differentiate the `what about' lexical construct from whataboutism. Our experiments bring to light unique challenges in its accurate detection, prompting the introduction of a novel method using attention weights for negative sample mining. We report significant improvements of 4% and 10% over previous state-of-the-art methods in our Twitter and YouTube collections, respectively.