Stavros Zervoudakis

AI
h-index7
4papers
34citations
Novelty33%
AI Score38

4 Papers

AIApr 15
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.

Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24GB GPU, we evaluate Qwen3-8B on the AppWorld benchmark under both full-precision and 4-bit quantized configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8% to 26.3% FP16; 5.3% to 14.0% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4 times their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.

AIAug 5, 2025
Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli et al.

Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.

CRFeb 26, 2021
Exploring the Effect of Resolution on the Usability of Locimetric Authentication

Antonios Saravanos, Dongnanzi Zheng, Stavros Zervoudakis et al.

Locimetric authentication is a form of graphical authentication in which users validate their identity by selecting predetermined points on a predetermined image. Its primary advantage over the ubiquitous text-based approach stems from users' superior ability to remember visual information over textual information, coupled with the authentication process being transformed to one requiring recognition (instead of recall). Ideally, these differentiations enable users to create more complex passwords, which theoretically are more secure. Yet locimetric authentication has one significant weakness: hot-spots. This term refers to areas of an image that users gravitate towards, and which consequently have a higher probability of being selected. Although many strategies have been proposed to counter the hot-spot problem, one area that has received little attention is that of resolution. The hypothesis here is that high-resolution images would afford the user a larger password space, and consequently any hot-spots would dissipate. We employ an experimental approach, where users generate a series of locimetric passwords on either low- or high-resolution images. Our research reveals the presence of hot-spots even in high-resolution images, albeit at a lower level than that exhibited with low-resolution images. We conclude by reinforcing that other techniques - such as existing or new software controls or training - need to be utilized to mitigate the emergence of hot-spots with the locimetric scheme.

HCJan 12, 2021
The Hidden Cost of Using Amazon Mechanical Turk for Research

Antonios Saravanos, Stavros Zervoudakis, Dongnanzi Zheng et al.

In this study, we investigate the attentiveness exhibited by participants sourced through Amazon Mechanical Turk (MTurk), thereby discovering a significant level of inattentiveness amongst the platform's top crowd workers (those classified as 'Master', with an 'Approval Rate' of 98% or more, and a 'Number of HITS approved' value of 1,000 or more). A total of 564 individuals from the United States participated in our experiment. They were asked to read a vignette outlining one of four hypothetical technology products and then complete a related survey. Three forms of attention check (logic, honesty, and time) were used to assess attentiveness. Through this experiment we determined that a total of 126 (22.3%) participants failed at least one of the three forms of attention check, with most (94) failing the honesty check - followed by the logic check (31), and the time check (27). Thus, we established that significant levels of inattentiveness exist even among the most elite MTurk workers. The study concludes by reaffirming the need for multiple forms of carefully crafted attention checks, irrespective of whether participant quality is presumed to be high according to MTurk criteria such as 'Master', 'Approval Rate', and 'Number of HITS approved'. Furthermore, we propose that researchers adjust their proposals to account for the effort and costs required to address participant inattentiveness.