Rumman Chowdhury

h-index7

7papers

157citations

Novelty32%

AI Score48

Ranked #28,694 of 194,257 authors (top 15%)#28 in CY (top 3%)

7 Papers

9.2CYNov 15, 2023

Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework

Markus Anderljung, Everett Thornton Smith, Joe O'Brien et al.

With the increasing integration of frontier large language models (LLMs) into society and the economy, decisions related to their training, deployment, and use have far-reaching implications. These decisions should not be left solely in the hands of frontier LLM developers. LLM users, civil society and policymakers need trustworthy sources of information to steer such decisions for the better. Involving outside actors in the evaluation of these systems - what we term 'external scrutiny' - via red-teaming, auditing, and external researcher access, offers a solution. Though there are encouraging signs of increasing external scrutiny of frontier LLMs, its success is not assured. In this paper, we survey six requirements for effective external scrutiny of frontier AI systems and organize them under the ASPIRE framework: Access, Searching attitude, Proportionality to the risks, Independence, Resources, and Expertise. We then illustrate how external scrutiny might function throughout the AI lifecycle and offer recommendations to policymakers.

16.5CYMay 29

Next-Billion AI Index: The compass for AI utility and adoption in the global majority

Ambrish Rawat, Jessica He, Subhabrata Majumdar et al.

Generative AI assessments remain dominated by frontier capability benchmarks that often fail to capture whether systems can be sustainably deployed, adapted, and trusted in locally grounded and infrastructure-constrained settings. This paper introduces the Next Billion AI Index (nexbax), which we believe is the first diagnostic framework to treat economic viability, operational deployability, and governance alignment as co-equal determinants of AI utility in next-billion-user contexts. Rather than treating usefulness as a single outcome, nexbax operationalizes the preconditions for useful AI through 10 dimensions organized under three themes: Effective Efficiency, Operational Practicality, and Societal Integrity. These dimensions assess whether systems are economically viable, deployable under infrastructure and workflow constraints, and aligned with local needs, user expectations, and collaborative development practices. We pair the framework with rubrics for weak, moderate, and strong performance, and conduct a formative expert evaluation through eleven semi-structured interviews with founders, developers, product leaders, and technical practitioners building AI systems for next-billion markets. Participants found the index useful for reasoning about adoption trade-offs and effective at capturing factors shaping real-world AI uptake -- particularly cost, usability, reliability, and trust. They also identified the need for contextual explanations, domain-specific evidence, and broader stakeholder validation. Nexbax is therefore proposed not as a universal score of social value, but as a diagnostic for artificial useful intelligence: a way to make visible the technical, economic, and governance properties that make inclusive AI deployment more viable.

23.6CYFeb 27, 2024

On the Societal Impact of Open Foundation Models

Sayash Kapoor, Rishi Bommasani, Kevin Klyman et al.

Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to both their benefits and risks. Open foundation models present significant benefits, with some caveats, that span innovation, competition, the distribution of decision-making power, and transparency. To understand their risks of misuse, we design a risk assessment framework for analyzing their marginal risk. Across several misuse vectors (e.g. cyberattacks, bioweapons), we find that current research is insufficient to effectively characterize the marginal risk of open foundation models relative to pre-existing technologies. The framework helps explain why the marginal risk is low in some cases, clarifies disagreements about misuse risks by revealing that past work has focused on different subsets of the framework with different assumptions, and articulates a way forward for more constructive debate. Overall, our work helps support a more grounded assessment of the societal impact of open foundation models by outlining what research is needed to empirically validate their theoretical benefits and risks.

17.4AIJun 25, 2025

The Singapore Consensus on Global AI Safety Research Priorities

Yoshua Bengio, Tegan Maharaj, Luke Ong et al. · cmu, mila

Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control).

4.4AIFeb 27

CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Reva Schwartz, Carina Westling, Morgan Briggs et al.

This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI's materialized outcomes in deployment. While existing frameworks like MLOps focus on system stability and benchmarks measure abstract capabilities, decision-makers outside the AI stack lack systematic evidence about the behavior of AI technologies under real-world user variability and constraints. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This can enable governance based on materialized downstream effects rather than theoretical capabilities.

2.3CYOct 22, 2025

Ask What Your Country Can Do For You: Towards a Public Red Teaming Model

Wm. Matthew Kennedy, Cigdem Patlak, Jayraj Dave et al.

AI systems have the potential to produce both benefits and harms, but without rigorous and ongoing adversarial evaluation, AI actors will struggle to assess the breadth and magnitude of the AI risk surface. Researchers from the field of systems design have developed several effective sociotechnical AI evaluation and red teaming techniques targeting bias, hate speech, mis/disinformation, and other documented harm classes. However, as increasingly sophisticated AI systems are released into high-stakes sectors (such as education, healthcare, and intelligence-gathering), our current evaluation and monitoring methods are proving less and less capable of delivering effective oversight. In order to actually deliver responsible AI and to ensure AI's harms are fully understood and its security vulnerabilities mitigated, pioneering new approaches to close this "responsibility gap" are now more urgent than ever. In this paper, we propose one such approach, the cooperative public AI red-teaming exercise, and discuss early results of its prior pilot implementations. This approach is intertwined with CAMLIS itself: the first in-person public demonstrator exercise was held in conjunction with CAMLIS 2024. We review the operational design and results of this exercise, the prior National Institute of Standards and Technology (NIST)'s Assessing the Risks and Impacts of AI (ARIA) pilot exercise, and another similar exercise conducted with the Singapore Infocomm Media Development Authority (IMDA). Ultimately, we argue that this approach is both capable of delivering meaningful results and is also scalable to many AI developing jurisdictions.

9.7CYFeb 3, 2022

Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics

Tomo Lazovich, Luca Belli, Aaron Gonzales et al.

The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases. Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes between these groups. However, in industry settings, such information is often not available, and inferring these characteristics carries its own risks and biases. Moreover, typical metrics that focus on a single classifier's output ignore the complex network of systems that produce outcomes in real-world settings. In this paper, we evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in a production recommendation system, the Twitter algorithmic timeline. We define desirable criteria for metrics to be used in an operational setting, specifically by ML practitioners. We characterize different types of engagement with content on Twitter using these metrics, and use these results to evaluate the metrics with respect to the desired criteria. We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users. Overall, we conclude that these metrics can be useful tools for understanding disparate outcomes in online social networks.