Panos Ipeirotis

HC
h-index5
6papers
9citations
Novelty40%
AI Score38

6 Papers

CYMar 18
Scalable and Personalized Oral Assessments Using Voice AI

Panos Ipeirotis, Konstantinos Rizakos

Large language models have broken take-home exams. Students generate polished work they cannot explain under follow-up questioning. Oral examinations are a natural countermeasure -- they require real-time reasoning and cannot be outsourced to an LLM -- but they have never scaled. Voice AI changes this. We describe a system that conducted 36 oral examinations for an undergraduate AI/ML course at a total cost of \$15 (\$0.42 per student), low enough to attach oral comprehension checks to every assignment rather than reserving them for high-stakes finals. Because the LLM generates questions dynamically from a rubric, the entire examination structure can be shared in advance: practice is learning, and there is no exam to leak. A multi-agent architecture decomposes each examination into structured phases, and a council of three LLM families grades each transcript through a deliberation round in which models revise scores after reviewing peer evidence, achieving inter-rater reliability (Krippendorff's $α$ = 0.86) above conventional thresholds. But the system also broke in instructive ways: the agent stacked questions despite explicit prohibitions, could not randomize case selection, and a cloned professorial voice was perceived as aggressive rather than familiar. The recurring lesson is that behavioral constraints on LLMs must be enforced through architecture, not prompting alone. Students largely agreed the format tested genuine understanding (70%), yet found it more stressful than written exams (83%) -- unsurprising given that 83% had never taken any oral examination. We document the full design, failure modes, and student experience, and include all prompts as appendices.

HCMar 2
Learning to Pay Attention: Unsupervised Modeling of Attentive and Inattentive Respondents in Survey Data

Ilias Triantafyllopoulos, Panos Ipeirotis

The integrity of behavioral and social-science surveys depends on detecting inattentive respondents who provide random or low-effort answers. Traditional safeguards, such as attention checks, are often costly, reactive, and inconsistent. We propose a unified, label-free framework for inattentiveness detection that scores response coherence using complementary unsupervised views: geometric reconstruction (Autoencoders) and probabilistic dependency modeling (Chow-Liu trees). While we introduce a "Percentile Loss" objective to improve Autoencoder robustness against anomalies, our primary contribution is identifying the structural conditions that enable unsupervised quality control. Across nine heterogeneous real-world datasets, we find that detection effectiveness is driven less by model complexity than by survey structure: instruments with coherent, overlapping item batteries exhibit strong covariance patterns that allow even linear models to reliably separate attentive from inattentive respondents. This reveals a critical ``Psychometric-ML Alignment'': the same design principles that maximize measurement reliability (e.g., internal consistency) also maximize algorithmic detectability. The framework provides survey platforms with a scalable, domain-agnostic diagnostic tool that links data quality directly to instrument design, enabling auditing without additional respondent burden.

LGMay 20, 2025
Algorithmic Hiring and Diversity: Reducing Human-Algorithm Similarity for Better Outcomes

Prasanna Parasurama, Panos Ipeirotis

Algorithmic tools are increasingly used in hiring to improve fairness and diversity, often by enforcing constraints such as gender-balanced candidate shortlists. However, we show theoretically and empirically that enforcing equal representation at the shortlist stage does not necessarily translate into more diverse final hires, even when there is no gender bias in the hiring stage. We identify a crucial factor influencing this outcome: the correlation between the algorithm's screening criteria and the human hiring manager's evaluation criteria -- higher correlation leads to lower diversity in final hires. Using a large-scale empirical analysis of nearly 800,000 job applications across multiple technology firms, we find that enforcing equal shortlists yields limited improvements in hire diversity when the algorithmic screening closely mirrors the hiring manager's preferences. We propose a complementary algorithmic approach designed explicitly to diversify shortlists by selecting candidates likely to be overlooked by managers, yet still competitive according to their evaluation criteria. Empirical simulations show that this approach significantly enhances gender diversity in final hires without substantially compromising hire quality. These findings highlight the importance of algorithmic design choices in achieving organizational diversity goals and provide actionable guidance for practitioners implementing fairness-oriented hiring algorithms.

APNov 11, 2021
Full Characterization of Adaptively Strong Majority Voting in Crowdsourcing

Margarita Boyarskaya, Panos Ipeirotis

In crowdsourcing, quality control is commonly achieved by having workers examine items and vote on their correctness. To minimize the impact of unreliable worker responses, a $δ$-margin voting process is utilized, where additional votes are solicited until a predetermined threshold $δ$ for agreement between workers is exceeded. The process is widely adopted but only as a heuristic. Our research presents a modeling approach using absorbing Markov chains to analyze the characteristics of this voting process that matter in crowdsourced processes. We provide closed-form equations for the quality of resulting consensus vote, the expected number of votes required for consensus, the variance of vote requirements, and other distribution moments. Our findings demonstrate how the threshold $δ$ can be adjusted to achieve quality equivalence across voting processes that employ workers with varying accuracy levels. We also provide efficiency-equalizing payment rates for voting processes with different expected response accuracy levels. Additionally, our model considers items with varying degrees of difficulty and uncertainty about the difficulty of each example. Our simulations, using real-world crowdsourced vote data, validate the effectiveness of our theoretical model in characterizing the consensus aggregation process. The results of our study can be effectively employed in practical crowdsourcing applications.

HCFeb 24, 2020
What do crowd workers think about creative work?

Jonas Oppenlaender, Aku Visuri, Kristy Milland et al.

Crowdsourcing platforms are a powerful and convenient means for recruiting participants in online studies and collecting data from the crowd. As information work is being more and more automated by Machine Learning algorithms, creativity $-$ that is, a human's ability for divergent and convergent thinking $-$ will play an increasingly important role on online crowdsourcing platforms. However, we lack insights into what crowd workers think about creative work. In studies in Human-Computer Interaction (HCI), the ability and willingness of the crowd to participate in creative work seems to be largely unquestioned. Insights into the workers' perspective are rare, but important, as they may inform the design of studies with higher validity. Given that creativity will play an increasingly important role in crowdsourcing, it is imperative to develop an understanding of how workers perceive creative work. In this paper, we summarize our recent worker-centered study of creative work on two general-purpose crowdsourcing platforms (Amazon Mechanical Turk and Prolific). Our study illuminates what creative work is like for crowd workers on these two crowdsourcing platforms. The work identifies several archetypal types of workers with different attitudes towards creative work, and discusses common pitfalls with creative work on crowdsourcing platforms.

HCJan 19, 2020
Creativity on Paid Crowdsourcing Platforms

Jonas Oppenlaender, Kristy Milland, Aku Visuri et al.

General-purpose crowdsourcing platforms are increasingly being harnessed for creative work. The platforms' potential for creative work is clearly identified, but the workers' perspectives on such work have not been extensively documented. In this paper, we uncover what the workers have to say about creative work on paid crowdsourcing platforms. Through a quantitative and qualitative analysis of a questionnaire launched on two different crowdsourcing platforms, our results revealed clear differences between the workers on the platforms in both preferences and prior experience with creative work. We identify common pitfalls with creative work on crowdsourcing platforms, provide recommendations for requesters of creative work, and discuss the meaning of our findings within the broader scope of creativity-oriented research. To the best of our knowledge, we contribute the first extensive worker-oriented study of creative work on paid crowdsourcing platforms.