Roland Daynauth

CL
h-index39
6papers
107citations
Novelty40%
AI Score36

6 Papers

SEDec 20, 2023Code
Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI's LLM with Open Source SLMs in Production

Chandra Irugalbandara, Ashish Mahendra, Roland Daynauth et al.

Many companies use large language models (LLMs) offered as a service, like OpenAI's GPT-4, to create AI-enabled product experiences. Along with the benefits of ease-of-use and shortened time-to-solution, this reliance on proprietary services has downsides in model control, performance reliability, uptime predictability, and cost. At the same time, a flurry of open-source small language models (SLMs) has been made available for commercial use. However, their readiness to replace existing capabilities remains unclear, and a systematic approach to holistically evaluate these SLMs is not readily available. This paper presents a systematic evaluation methodology and a characterization of modern open-source SLMs and their trade-offs when replacing proprietary LLMs for a real-world product feature. We have designed SLaM, an open-source automated analysis tool that enables the quantitative and qualitative testing of product features utilizing arbitrary SLMs. Using SLaM, we examine the quality and performance characteristics of modern SLMs relative to an existing customer-facing implementation using the OpenAI GPT-4 API. Across 9 SLMs and their 29 variants, we observe that SLMs provide competitive results, significant performance consistency improvements, and a cost reduction of 5x~29x when compared to GPT-4.

CLJul 5, 2024
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Roland Daynauth, Jason Mars

The SLAM paper demonstrated that on-device Small Language Models (SLMs) are a viable and cost-effective alternative to API-based Large Language Models (LLMs), such as OpenAI's GPT-4, offering comparable performance and stability. However, SLAM also identified discrepancies between human preferences and traditional auto-evaluators. This follow-up paper explores methods to align LLM evaluator preferences with human evaluations by addressing biases, particularly toward higher token counts. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases. For instance, spearman's ranking correlation score in the Recommendation use case improved from -27.27 to 44.55. These results highlight the importance of accounting for biases in automated evaluations to ensure fair and accurate model assessments. The recalibration process enhances the reliability of automated evaluators, leading to better AI models that align with human values and expectations. This study provides a robust methodology for future research into bias correction and emphasizes the feasibility and benefits of developing human-aligned AI evaluation systems.

CLNov 19, 2024
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Roland Daynauth, Christopher Clarke, Krisztian Flautner et al.

Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.

CLMay 21, 2025
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Roland Daynauth, Christopher Clarke, Krisztian Flautner et al.

The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.

CLMay 6, 2024
Guylingo: The Republic of Guyana Creole Corpora

Christopher Clarke, Roland Daynauth, Charlene Wilkinson et al.

While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as "English speaking", the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean.

CLMay 17, 2023
The Jaseci Programming Paradigm and Runtime Stack: Building Scale-out Production Applications Easy and Fast

Jason Mars, Yiping Kang, Roland Daynauth et al.

Today's production scale-out applications include many sub-application components, such as storage backends, logging infrastructure and AI models. These components have drastically different characteristics, are required to work in collaboration, and interface with each other as microservices. This leads to increasingly high complexity in developing, optimizing, configuring, and deploying scale-out applications, raising the barrier to entry for most individuals and small teams. We developed a novel co-designed runtime system, Jaseci, and programming language, Jac, which aims to reduce this complexity. The key design principle throughout Jaseci's design is to raise the level of abstraction by moving as much of the scale-out data management, microservice componentization, and live update complexity into the runtime stack to be automated and optimized automatically. We use real-world AI applications to demonstrate Jaseci's benefit for application performance and developer productivity.