Stavros Zervoudakis

h-index2

3papers

29citations

Novelty25%

AI Score33

Ranked #121,371 of 194,257 authors (top 62%)#7,439 in AI (top 59%)

3 Papers

11.4AIApr 15

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.

Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24GB GPU, we evaluate Qwen3-8B on the AppWorld benchmark under both full-precision and 4-bit quantized configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8% to 26.3% FP16; 5.3% to 14.0% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4 times their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.

3.8CRFeb 26, 2021

Exploring the Effect of Resolution on the Usability of Locimetric Authentication

Antonios Saravanos, Dongnanzi Zheng, Stavros Zervoudakis et al.

Locimetric authentication is a form of graphical authentication in which users validate their identity by selecting predetermined points on a predetermined image. Its primary advantage over the ubiquitous text-based approach stems from users' superior ability to remember visual information over textual information, coupled with the authentication process being transformed to one requiring recognition (instead of recall). Ideally, these differentiations enable users to create more complex passwords, which theoretically are more secure. Yet locimetric authentication has one significant weakness: hot-spots. This term refers to areas of an image that users gravitate towards, and which consequently have a higher probability of being selected. Although many strategies have been proposed to counter the hot-spot problem, one area that has received little attention is that of resolution. The hypothesis here is that high-resolution images would afford the user a larger password space, and consequently any hot-spots would dissipate. We employ an experimental approach, where users generate a series of locimetric passwords on either low- or high-resolution images. Our research reveals the presence of hot-spots even in high-resolution images, albeit at a lower level than that exhibited with low-resolution images. We conclude by reinforcing that other techniques - such as existing or new software controls or training - need to be utilized to mitigate the emergence of hot-spots with the locimetric scheme.

8.6HCJan 12, 2021

The Hidden Cost of Using Amazon Mechanical Turk for Research

Antonios Saravanos, Stavros Zervoudakis, Dongnanzi Zheng et al.

In this study, we investigate the attentiveness exhibited by participants sourced through Amazon Mechanical Turk (MTurk), thereby discovering a significant level of inattentiveness amongst the platform's top crowd workers (those classified as 'Master', with an 'Approval Rate' of 98% or more, and a 'Number of HITS approved' value of 1,000 or more). A total of 564 individuals from the United States participated in our experiment. They were asked to read a vignette outlining one of four hypothetical technology products and then complete a related survey. Three forms of attention check (logic, honesty, and time) were used to assess attentiveness. Through this experiment we determined that a total of 126 (22.3%) participants failed at least one of the three forms of attention check, with most (94) failing the honesty check - followed by the logic check (31), and the time check (27). Thus, we established that significant levels of inattentiveness exist even among the most elite MTurk workers. The study concludes by reaffirming the need for multiple forms of carefully crafted attention checks, irrespective of whether participant quality is presumed to be high according to MTurk criteria such as 'Master', 'Approval Rate', and 'Number of HITS approved'. Furthermore, we propose that researchers adjust their proposals to account for the effort and costs required to address participant inattentiveness.