CYJun 9, 2023
Evaluating the Social Impact of Generative AI Systems in Systems and SocietyIrene Solaiman, Zeerak Talat, William Agnew et al. · allen-ai, cmu
Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm.
AIFeb 18
When AI Benchmarks Plateau: A Systematic Study of Benchmark SaturationMubashara Akhtar, Anka Reuel, Prajna Soni et al. · meta-ai
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
60.1ROMar 14Code
Multi-Robot Navigation in Social Mini-Games: Definitions, Taxonomy, and AlgorithmsRohan Chandra, Shubham Singh, Wenhao Luo et al.
The "Last Mile Challenge" has long been considered an important, yet unsolved, challenge for autonomous vehicles, public service robots, and delivery robots. A central issue in this challenge is the ability of robots to navigate constrained and cluttered environments that have high agency (e.g., doorways, hallways, corridor intersections), often while competing for space with other robots and humans. We refer to these environments as "Social Mini-Games" (SMGs). Traditional navigation approaches designed for MRN do not perform well in SMGs, which has led to focused research on dedicated SMG solvers. However, publications on SMG navigation research make different assumptions, and have different objective functions (safety versus liveness). These assumptions and objectives are sometimes implicitly assumed or described informally. This makes it difficult to establish appropriate baselines for comparison in research papers, as well as making it difficult for practitioners to find the papers relevant to their concrete application. Such ad-hoc representation of the field also presents a barrier to new researchers wanting to start research in this area. SMG navigation research requires its own taxonomy, definitions, and evaluation protocols to guide effective research moving forward. This survey is the first to catalog SMG solvers using a well-defined and unified taxonomy and to classify existing methods accordingly. It also discusses the essential properties of SMG solvers, defines what SMGs are and how they appear in practice, outlines how to evaluate SMG solvers, and highlights the differences between SMG solvers and general navigation systems. The survey concludes with an overview of future directions and open challenges in the field. Our project is open-sourced at https://socialminigames.github.io/{https://socialminigames.github.io/.
CRSep 7, 2024
Towards identifying Source credibility on Information Leakage in Digital Gadget MarketNeha Kumaru, Garvit Gupta, Shreyas Mongia et al.
The use of Social media to share content is on a constant rise. One of the capsize effect of information sharing on Social media includes the spread of sensitive information on the public domain. With the digital gadget market becoming highly competitive and ever-evolving, the trend of an increasing number of sensitive posts leaking information on devices in social media is observed. Many web-blogs on digital gadget market have mushroomed recently, making the problem of information leak all pervasive. Credible leaks on specifics of an upcoming device can cause a lot of financial damage to the respective organization. Hence, it is crucial to assess the credibility of the platforms that continuously post about a smartphone or digital gadget leaks. In this work, we analyze the headlines of leak web-blog posts and their corresponding official press-release. We first collect 54, 495 leak and press-release headlines for different smartphones. We train our custom NER model to capture the evolving smartphone names with an accuracy of 82.14% on manually annotated results. We further propose a credibility score metric for the web-blog, based on the number of falsified and authentic smartphone leak posts.
CYJul 7, 2023
AI and the EU Digital Markets Act: Addressing the Risks of Bigness in Generative AIAyse Gizem Yasar, Andrew Chong, Evan Dong et al.
As AI technology advances rapidly, concerns over the risks of bigness in digital markets are also growing. The EU's Digital Markets Act (DMA) aims to address these risks. Still, the current framework may not adequately cover generative AI systems that could become gateways for AI-based services. This paper argues for integrating certain AI software as core platform services and classifying certain developers as gatekeepers under the DMA. We also propose an assessment of gatekeeper obligations to ensure they cover generative AI services. As the EU considers generative AI-specific rules and possible DMA amendments, this paper provides insights towards diversity and openness in generative AI services.
ROMay 11, 2021Code
Efficient Analytical Derivatives of Rigid-Body Dynamics using Spatial Vector AlgebraShubham Singh, Ryan P. Russell, Patrick M. Wensing
An essential need for many model-based robot control algorithms is the ability to quickly and accurately compute partial derivatives of the equations of motion. State of the art approaches to this problem often use analytical methods based on the chain rule applied to existing dynamics algorithms. Although these methods are an improvement over finite differences in terms of accuracy, they are not always the most efficient. In this paper, we contribute new closed-form expressions for the first-order partial derivatives of inverse dynamics, leading to a recursive algorithm. The algorithm is benchmarked against chain-rule approaches in Fortran and against an existing algorithm from the Pinocchio library in C++. Tests consider computing the partial derivatives of inverse and forward dynamics for robots ranging from kinematic chains to humanoids and quadrupeds. Compared to the previous open-source Pinocchio implementation, our new analytical results uncover a key computational restructuring that enables efficiency gains. Speedups of up to 1.4x are reported for calculating the partial derivatives of inverse dynamics for the 50-dof Talos humanoid.
LGAug 4, 2024
KAN based Autoencoders for Factor ModelsTianqi Wang, Shubham Singh
Inspired by recent advances in Kolmogorov-Arnold Networks (KANs), we introduce a novel approach to latent factor conditional asset pricing models. While previous machine learning applications in asset pricing have predominantly used Multilayer Perceptrons with ReLU activation functions to model latent factor exposures, our method introduces a KAN-based autoencoder which surpasses MLP models in both accuracy and interpretability. Our model offers enhanced flexibility in approximating exposures as nonlinear functions of asset characteristics, while simultaneously providing users with an intuitive framework for interpreting latent factors. Empirical backtesting demonstrates our model's superior ability to explain cross-sectional risk exposures. Moreover, long-short portfolios constructed using our model's predictions achieve higher Sharpe ratios, highlighting its practical value in investment management.
IVSep 8, 2023
Systematic Review of Techniques in Brain Image Synthesis using Deep LearningShubham Singh, Ammar Ranapurwala, Mrunal Bewoor et al.
This review paper delves into the present state of medical imaging, with a specific focus on the use of deep learning techniques for brain image synthesis. The need for medical image synthesis to improve diagnostic accuracy and decrease invasiveness in medical procedures is emphasized, along with the role of deep learning in enabling these advancements. The paper examines various methods and techniques for brain image synthesis, including 2D to 3D constructions, MRI synthesis, and the use of transformers. It also addresses limitations and challenges faced in these methods, such as obtaining well-curated training data and addressing brain ultrasound issues. The review concludes by exploring the future potential of this field and the opportunities for further advancements in medical imaging using deep learning techniques. The significance of transformers and their potential to revolutionize the medical imaging field is highlighted. Additionally, the paper discusses the potential solutions to the shortcomings and limitations faced in this field. The review provides researchers with an updated reference on the present state of the field and aims to inspire further research and bridge the gap between the present state of medical imaging and the future possibilities offered by deep learning techniques.
IVOct 11, 2023
BrainVoxGen: Deep learning framework for synthesis of Ultrasound to MRIShubham Singh, Mrunal Bewoor, Ammar Ranapurwala et al.
The work proposes a novel deep-learning framework for the synthesis of three-dimensional MRI volumes from corresponding 3D ultrasound images of the brain, leveraging a modified iteration of the Pix2Pix Generative Adversarial Network (GAN) model. Addressing the formidable challenge of bridging the modality disparity between ultrasound and MRI, this research holds promise for transformative applications in medical diagnostics and treatment planning within the neuroimaging domain. While the findings reveal a discernible degree of similarity between the synthesized MRI volumes and anticipated outcomes, they fall short of practical deployment standards, primarily due to constraints associated with dataset scale and computational resources. The methodology yields MRI volumes with a satisfactory similarity score, establishing a foundational benchmark for subsequent investigations.
CLJun 16, 2025
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled RepresentationsAbhilekh Borah, Chhavi Sharma, Danush Khanna et al.
Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
LGJan 16, 2024
Transformer-based approach for Ethereum Price Prediction Using Crosscurrency correlation and Sentiment AnalysisShubham Singh, Mayur Bhat
The research delves into the capabilities of a transformer-based neural network for Ethereum cryptocurrency price forecasting. The experiment runs around the hypothesis that cryptocurrency prices are strongly correlated with other cryptocurrencies and the sentiments around the cryptocurrency. The model employs a transformer architecture for several setups from single-feature scenarios to complex configurations incorporating volume, sentiment, and correlated cryptocurrency prices. Despite a smaller dataset and less complex architecture, the transformer model surpasses ANN and MLP counterparts on some parameters. The conclusion presents a hypothesis on the illusion of causality in cryptocurrency price movements driven by sentiments.
CVAug 25, 2021
Multi-Attributed and Structured Text-to-Face SynthesisRohan Wadhawan, Tanuj Drall, Shubham Singh et al.
Generative Adversarial Networks (GANs) have revolutionized image synthesis through many applications like face generation, photograph editing, and image super-resolution. Image synthesis using GANs has predominantly been uni-modal, with few approaches that can synthesize images from text or other data modes. Text-to-image synthesis, especially text-to-face synthesis, has promising use cases of robust face-generation from eye witness accounts and augmentation of the reading experience with visual cues. However, only a couple of datasets provide consolidated face data and textual descriptions for text-to-face synthesis. Moreover, these textual annotations are less extensive and descriptive, which reduces the diversity of faces generated from it. This paper empirically proves that increasing the number of facial attributes in each textual description helps GANs generate more diverse and real-looking faces. To prove this, we propose a new methodology that focuses on using structured textual descriptions. We also consolidate a Multi-Attributed and Structured Text-to-face (MAST) dataset consisting of high-quality images with structured textual annotations and make it available to researchers to experiment and build upon. Lastly, we report benchmark Frechet's Inception Distance (FID), Facial Semantic Similarity (FSS), and Facial Semantic Distance (FSD) scores for the MAST dataset.
LGAug 12, 2021
Fair Decision-Making for Food InspectionsShubham Singh, Bhuvni Shah, Chris Kanich et al.
Data and algorithms are essential and complementary parts of a large-scale decision-making process. However, their injudicious use can lead to unforeseen consequences, as has been observed by researchers and activists alike in the recent past. In this paper, we revisit the application of predictive models by the Chicago Department of Public Health to schedule restaurant inspections and prioritize the detection of critical food code violations. We perform the first analysis of the model's fairness to the population served by the restaurants in terms of average time to find a critical violation. We find that the model treats inspections unequally based on the sanitarian who conducted the inspection and that, in turn, there are geographic disparities in the benefits of the model. We examine four alternate methods of model training and two alternative ways of scheduling using the model and find that the latter generate more desirable results. The challenges from this application point to important directions for future work around fairness with collective entities rather than individuals, the use of critical violations as a proxy, and the disconnect between fair classification and fairness in the dynamic scheduling system.
HCOct 2, 2020
Real-time Collaboration Between Mixed Reality Users in Geo-referenced Virtual EnvironmentShubham Singh, Zengou Ma, Daniele Giunchi et al.
Collaboration using mixed reality technology is an active area of research, where significant research is done to virtually bridge physical distances. There exist a diverse set of platforms and devices that can be used for a mixed-reality collaboration, and is largely focused for indoor scenarios, where, a stable tracking can be assumed. We focus on supporting collaboration between VR and AR users, where AR user is mobile outdoors, and VR user is immersed in true-sized digital twin. This cross-platform solution requires new user experiences for interaction, accurate modelling of the real-world, and working with noisy outdoor tracking sensor such as GPS. In this paper, we present our results and observations of real-time collaboration between cross-platform users, in the context of a geo-referenced virtual environment. We propose a solution for using GPS measurement in VSLAM to localize the AR user in an outdoor environment. The client applications enable VR and AR user to collaborate across the heterogeneous platforms seamlessly. The user can place or load dynamic contents tagged to a geolocation and share their experience with remote users in real-time.
DLApr 22, 2020
Visible Insights of the Invisible Pandemic: A Scientometric, Altmetric and Topic Trend AnalysisSujit Bhattacharya, Shubham Singh
The recent SARS-COV-2 virus outbreak has created an unprecedented global health crisis! The disease is showing alarming trends with the number of people getting infected with this disease, new cases and death rate are all highlighting the need to control this disease at the earliest. The strategy now for the governments around the globe is how to limit the spread of the virus until the research community develops treatment/drug or vaccination against the virus. The outbreak of this disease has unsurprisingly led to huge volume of research within a short period of time surrounding this disease. It has also led to aggressive social media activity on twitter, Facebook, dedicated blogs, news reports and other online sites actively involved in discussing about the various aspects of and related to this disease. It becomes a useful and challenging exercise to draw from this huge volume of research, the key papers that form the research front, its influence in the research community, and other important research insights. Similarly, it becomes important to discern the key issues that influence the society concerning this disease. The paper is motivated by this. It attempts to distinguish which are the most influential papers, the key knowledge base and major topics surrounding the research covered by COVID-19. Further it attempts to capture the society's perception by discerning key topics that are trending online. The study concludes by highlighting the implications of this study.