CYApr 2
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed RubricsNari Johnson, Deepthi Sudharsan, Hamna et al.
Measurement is essential to improving AI performance and mitigating harms for marginalized groups. As generative AI systems are rapidly deployed across geographies and contexts, AI measurement practices must be designed to support repeatable, automatable application across different models, datasets, and evaluation settings. But the drive to automate measurement can be in tension with the ability for measurement instruments to capture the expertise and perspectives of communities impacted by AI. Recent work advocates for breaking measurement into several key stages: first moving from an abstract concept to be measured into a precise, "systematized" concept; next operationalizing the systematized concept into a concrete measurement instrument; and finally applying the measurement instrument on data to produce measurements. This opens up an opportunity to concentrate community engagement in the systematization phase before operationalizing and applying measurement instruments. In this paper, we explore how to involve communities in systematizing the concept of "cultural appropriateness" in text-to-image models' representation of culturally significant artifacts through case studies with three communities: blind and low vision individuals residing in the UK, residents of Kerala, and residents of Tamil Nadu. Our systematized concepts reflect community members' lived experiences interacting with each artifact and how they want their material culture to be depicted, demonstrating the value of community involvement in defining valid measures. We explore how these systematized concepts can be operationalized into automated measurement instruments that could be applied using a multimodal LLM-as-a-judge approach and challenges that remain. We reflect on the benefits and limitations of such approaches.
CLOct 25, 2024
KAHANI: Culturally-Nuanced Visual Storytelling Tool for Non-Western CulturesHamna, Deepthi Sudharsan, Agrima Seth et al.
Large Language Models (LLMs) and Text-To-Image (T2I) models have demonstrated the ability to generate compelling text and visual stories. However, their outputs are predominantly aligned with the sensibilities of the Global North, often resulting in an outsider's gaze on other cultures. As a result, non-Western communities have to put extra effort into generating culturally specific stories. To address this challenge, we developed a visual storytelling tool called Kahani that generates culturally grounded visual stories for non-Western cultures. Our tool leverages off-the-shelf models GPT-4 Turbo and Stable Diffusion XL (SDXL). By using Chain of Thought (CoT) and T2I prompting techniques, we capture the cultural context from user's prompt and generate vivid descriptions of the characters and scene compositions. To evaluate the effectiveness of Kahani, we conducted a comparative user study with ChatGPT-4 (with DALL-E3) in which participants from different regions of India compared the cultural relevance of stories generated by the two tools. The results of the qualitative and quantitative analysis performed in the user study show that Kahani's visual stories are more culturally nuanced than those generated by ChatGPT-4. In 27 out of 36 comparisons, Kahani outperformed or was on par with ChatGPT-4, effectively capturing cultural nuances and incorporating more Culturally Specific Items (CSI), validating its ability to generate culturally grounded visual stories.
CLSep 29, 2025
Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot SettingsHamna, Gayatri Bhat, Sourabrata Mukherjee et al.
Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.