CRMar 1, 2024Code
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMsTanmay Rajore, Nishanth Chandran, Sunayana Sitaram et al.
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost. However, recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs, meaning that LLMs have access to test data during pretraining and/or fine-tuning. This raises serious concerns about the validity of benchmarking studies conducted so far and the future of evaluation using benchmarks. To solve this problem, we propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model. We describe various scenarios (depending on the trust placed on model owners or dataset owners), and present solutions to avoid data contamination using private benchmarking. For scenarios where the model weights need to be kept private, we describe solutions from confidential computing and cryptography that can aid in private benchmarking. We build an end-to-end system, TRUCE, that enables such private benchmarking showing that the overheads introduced to protect models and benchmark are negligible (in the case of confidential computing) and tractable (when cryptographic security is required). Finally, we also discuss solutions to the problem of benchmark dataset auditing, to ensure that private benchmarks are of sufficiently high quality.
CLJun 21, 2024
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural DataIshaan Watts, Varun Gumma, Aditya Yadavalli et al.
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyze the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
HCJan 8, 2021
Adaptive Accessible AR/VR SystemsPradipta Biswas, Pilar Orero, Manohar Swaminathan et al.
Augmented, virtual and mixed reality technologies offer new ways of interacting with digital media. However, such technologies are not well explored for people with different ranges of abilities beyond a few specific navigation and gaming applications. While new standardization activities are investigating accessibility issues with existing AR/VR systems, commercial systems are still confined to specialized hardware and software limiting their widespread adoption among people with disabilities as well as seniors. This proposal takes a novel approach by exploring the application of user model-based personalization for AR/VR systems to improve accessibility. The workshop will be organized by experienced researchers in the field of human computer interaction, robotics control, assistive technology, and AR/VR systems, and will consist of peer reviewed papers and hands-on demonstrations. Keynote speeches and demonstrations will cover latest accessibility research at Microsoft, Google, Verizon and leading universities.