Joseph Jay Williams

HC
h-index11
22papers
376citations
Novelty35%
AI Score45

22 Papers

77.3HCMay 27
Designing for the Moment: How One-Minute Interventions Fit or Falter Across Domains

Zahra Hassanzadeh, Anne Hsu, Rachel Kornfield et al.

This paper explores the design space for one-minute digital interventions that prompt immediate action without onboarding or sensing. By embracing Fogg's Behavior Model and four design principles informed by literature, the goal of these interventions was to provide triggers that encourage actions so simple that even people with low motivation would be willing to complete them. We examined the utility of these prompts by conducting a 14-day study with 22 participants interested in making small lifestyle improvements in at least one of three domains: physical activity, healthy eating, and mental well-being. When combined with insights drawn from participants' rewrites of our prompts, our findings suggest that intentional personalization through co-authorship could be a lightweight personalization mechanism that balances relevance with low friction.

EMNov 22, 2022
Contextual Bandits in a Survey Experiment on Charitable Giving: Within-Experiment Outcomes versus Policy Learning

Susan Athey, Undral Byambadalai, Vitor Hadad et al.

We design and implement an adaptive experiment (a ``contextual bandit'') to learn a targeted treatment assignment policy, where the goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation. The design balances two competing objectives: optimizing the outcomes for the subjects in the experiment (``cumulative regret minimization'') and gathering data that will be most useful for policy learning, that is, for learning an assignment rule that will maximize welfare if used after the experiment (``simple regret minimization''). We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study. Next, we implement our selected algorithm. Finally, we perform a second simulation study anchored to the collected data that evaluates the benefits of the algorithm we chose. Our first result is that the value of a learned policy in this setting is higher when data is collected via a uniform randomization rather than collected adaptively using standard cumulative regret minimization or policy learning algorithms. We propose a simple heuristic for adaptive experimentation that improves upon uniform randomization from the perspective of policy learning at the expense of increasing cumulative regret relative to alternative bandit algorithms. The heuristic modifies an existing contextual bandit algorithm by (i) imposing a lower bound on assignment probabilities that decay slowly so that no arm is discarded too quickly, and (ii) after adaptively collecting data, restricting policy learning to select from arms where sufficient data has been gathered.

MLMar 4, 2022
Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions

Nina Deliu, Joseph Jay Williams, Bibhas Chakraborty

In recent years, reinforcement learning (RL) has acquired a prominent position in health-related sequential decision-making problems, gaining traction as a valuable tool for delivering adaptive interventions (AIs). However, in part due to a poor synergy between the methodological and the applied communities, its real-life application is still limited and its potential is still to be realized. To address this gap, our work provides the first unified technical survey on RL methods, complemented with case studies, for constructing various types of AIs in healthcare. In particular, using the common methodological umbrella of RL, we bridge two seemingly different AI domains, dynamic treatment regimes and just-in-time adaptive interventions in mobile health, highlighting similarities and differences between them and discussing the implications of using RL. Open problems and considerations for future research directions are outlined. Finally, we leverage our experience in designing case studies in both areas to showcase the significant collaborative opportunities between statistical, RL, and healthcare researchers in advancing AIs.

AISep 6, 2023
Getting too personal(ized): The importance of feature choice in online adaptive algorithms

ZhaoBin Li, Luna Yee, Nathaniel Sauerberg et al.

Digital educational technologies offer the potential to customize students' experiences and learn what works for which students, enhancing the technology as more students interact with it. We consider whether and when attempting to discover how to personalize has a cost, such as if the adaptation to personal information can delay the adoption of policies that benefit all students. We explore these issues in the context of using multi-armed bandit (MAB) algorithms to learn a policy for what version of an educational technology to present to each student, varying the relation between student characteristics and outcomes and also whether the algorithm is aware of these characteristics. Through simulations, we demonstrate that the inclusion of student characteristics for personalization can be beneficial when those characteristics are needed to learn the optimal action. In other scenarios, this inclusion decreases performance of the bandit algorithm. Moreover, including unneeded student characteristics can systematically disadvantage students with less common values for these characteristics. Our simulations do however suggest that real-time personalization will be helpful in particular real-world scenarios, and we illustrate this through case studies using existing experimental results in ASSISTments. Overall, our simulations show that adaptive personalization in educational technologies can be a double-edged sword: real-time adaptation improves student experiences in some contexts, but the slower adaptation and potentially discriminatory results mean that a more personalized model is not always beneficial.

LGAug 10, 2022
Using Adaptive Experiments to Rapidly Help Students

Angela Zavaleta-Bernuy, Qi Yin Zheng, Hammad Shaikh et al.

Adaptive experiments can increase the chance that current students obtain better outcomes from a field experiment of an instructional intervention. In such experiments, the probability of assigning students to conditions changes while more data is being collected, so students can be assigned to interventions that are likely to perform better. Digital educational environments lower the barrier to conducting such adaptive experiments, but they are rarely applied in education. One reason might be that researchers have access to few real-world case studies that illustrate the advantages and disadvantages of these experiments in a specific context. We evaluate the effect of homework email reminders in students by conducting an adaptive experiment using the Thompson Sampling algorithm and compare it to a traditional uniform random experiment. We present this as a case study on how to conduct such experiments, and we raise a range of open questions about the conditions under which adaptive randomized experiments may be more or less useful.

HCOct 13, 2023
Impact of Guidance and Interaction Strategies for LLM Use on Learner Performance and Perception

Harsh Kumar, Ilya Musabirov, Mohi Reza et al.

Personalized chatbot-based teaching assistants can be crucial in addressing increasing classroom sizes, especially where direct teacher presence is limited. Large language models (LLMs) offer a promising avenue, with increasing research exploring their educational utility. However, the challenge lies not only in establishing the efficacy of LLMs but also in discerning the nuances of interaction between learners and these models, which impact learners' engagement and results. We conducted a formative study in an undergraduate computer science classroom (N=145) and a controlled experiment on Prolific (N=356) to explore the impact of four pedagogically informed guidance strategies on the learners' performance, confidence and trust in LLMs. Direct LLM answers marginally improved performance, while refining student solutions fostered trust. Structured guidance reduced random queries as well as instances of students copy-pasting assignment questions to the LLM. Our work highlights the role that teachers can play in shaping LLM-supported learning environments.

AIOct 13, 2023
Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health

Harsh Kumar, Tong Li, Jiakai Shi et al.

Digital mental health (DMH) interventions, such as text-message-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, can lead to continuous improvement and personalization. However, it remains unclear when these algorithms can simultaneously increase user experience rewards and facilitate appropriate data collection for social-behavioral scientists to analyze with sufficient statistical confidence. Although a growing body of research addresses the practical and statistical aspects of MAB and other adaptive algorithms, further exploration is needed to assess their impact across diverse real-world contexts. This paper presents a software system developed over two years that allows text-messaging intervention components to be adapted using bandit and other algorithms while collecting data for side-by-side comparison with traditional uniform random non-adaptive experiments. We evaluate the system by deploying a text-message-based DMH intervention to 1100 users, recruited through a large mental health non-profit organization, and share the path forward for deploying this system at scale. This system not only enables applications in mental health but could also serve as a model testbed for adaptive experimentation algorithms in other domains.

HCSep 29, 2023
ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models

Mohi Reza, Nathan Laundry, Ilya Musabirov et al.

Exploring alternative ideas by rewriting text is integral to the writing process. State-of-the-art Large Language Models (LLMs) can simplify writing variation generation. However, current interfaces pose challenges for simultaneous consideration of multiple variations: creating new variations without overwriting text can be difficult, and pasting them sequentially can clutter documents, increasing workload and disrupting writers' flow. To tackle this, we present ABScribe, an interface that supports rapid, yet visually structured, exploration and organization of writing variations in human-AI co-writing tasks. With ABScribe, users can swiftly modify variations using LLM prompts, which are auto-converted into reusable buttons. Variations are stored adjacently within text fields for rapid in-place comparisons using mouse-over interactions on a popup toolbar. Our user study with 12 writers shows that ABScribe significantly reduces task workload (d = 1.20, p < 0.001), enhances user perceptions of the revision process (d = 2.41, p < 0.001) compared to a popular baseline workflow, and provides insights into how writers explore variations using LLMs.

LGAug 10, 2022
Increasing Students' Engagement to Reminder Emails Through Multi-Armed Bandits

Fernando J. Yanez, Angela Zavaleta-Bernuy, Ziwen Han et al.

Conducting randomized experiments in education settings raises the question of how we can use machine learning techniques to improve educational interventions. Using Multi-Armed Bandits (MAB) algorithms like Thompson Sampling (TS) in adaptive experiments can increase students' chances of obtaining better outcomes by increasing the probability of assignment to the most optimal condition (arm), even before an intervention completes. This is an advantage over traditional A/B testing, which may allocate an equal number of students to both optimal and non-optimal conditions. The problem is the exploration-exploitation trade-off. Even though adaptive policies aim to collect enough information to allocate more students to better arms reliably, past work shows that this may not be enough exploration to draw reliable conclusions about whether arms differ. Hence, it is of interest to provide additional uniform random (UR) exploration throughout the experiment. This paper shows a real-world adaptive experiment on how students engage with instructors' weekly email reminders to build their time management habits. Our metric of interest is open email rates which tracks the arms represented by different subject lines. These are delivered following different allocation algorithms: UR, TS, and what we identified as TS† - which combines both TS and UR rewards to update its priors. We highlight problems with these adaptive algorithms - such as possible exploitation of an arm when there is no significant difference - and address their causes and consequences. Future directions includes studying situations where the early choice of the optimal arm is not ideal and how adaptive algorithms can address them.

HCOct 18, 2023
Opportunities for Adaptive Experiments to Enable Continuous Improvement in Computer Science Education

Ilya Musabirov, Angela Zavaleta-Bernuy, Pan Chen et al.

Randomized A/B comparisons of alternative pedagogical strategies or other course improvements could provide useful empirical evidence for instructor decision-making. However, traditional experiments do not provide a straightforward pathway to rapidly utilize data, increasing the chances that students in an experiment experience the best conditions. Drawing inspiration from the use of machine learning and experimentation in product development at leading technology companies, we explore how adaptive experimentation might aid continuous course improvement. In adaptive experiments, data is analyzed and utilized as different conditions are deployed to students. This can be achieved using machine learning algorithms to identify which actions are more beneficial in improving students' learning experiences and outcomes. These algorithms can then dynamically deploy the most effective conditions in subsequent interactions with students, resulting in better support for students' needs. We illustrate this approach with a case study that provides a side-by-side comparison of traditional and adaptive experiments on adding self-explanation prompts in online homework problems in a CS1 course. This work paves the way for exploring the importance of adaptive experiments in bridging research and practice to achieve continuous improvement in educational settings.

HCApr 16, 2025
Co-Writing with AI, on Human Terms: Aligning Research with User Demands Across the Writing Process

Mohi Reza, Jeb Thomas-Mitchell, Peter Dushniku et al.

As generative AI tools like ChatGPT become integral to everyday writing, critical questions arise about how to preserve writers' sense of agency and ownership when using these tools. Yet, a systematic understanding of how AI assistance affects different aspects of the writing process - and how this shapes writers' agency - remains underexplored. To address this gap, we conducted a systematic review of 109 HCI papers using the PRISMA approach. From this literature, we identify four overarching design strategies for AI writing support: structured guidance, guided exploration, active co-writing, and critical feedback - mapped across the four key cognitive processes in writing: planning, translating, reviewing, and monitoring. We complement this analysis with interviews of 15 writers across diverse domains. Our findings reveal that writers' desired levels of AI intervention vary across the writing process: content-focused writers (e.g., academics) prioritize ownership during planning, while form-focused writers (e.g., creatives) value control over translating and reviewing. Writers' preferences are also shaped by contextual goals, values, and notions of originality and authorship. By examining when ownership matters, what writers want to own, and how AI interactions shape agency, we surface both alignment and gaps between research and user needs. Our findings offer actionable design guidance for developing human-centered writing tools for co-writing with AI, on human terms.

HCMar 6
Structured Exploration vs. Generative Flexibility: A Field Study Comparing Bandit and LLM Architectures for Personalised Health Behaviour Interventions

Dominik P. Hofer, Haochen Song, Rania Islambouli et al.

Behaviour Change Techniques (BCTs) are central to digital health interventions, yet selecting and delivering effective techniques remains challenging. Contextual bandits enable statistically grounded optimisation of BCT selection, while Large Language Models (LLMs) offer flexible, context-sensitive message generation. We conducted a 4-week study on physical activity motivation (N=54; 9 post-study interviews) that compared five daily messaging approaches: random templates, contextual bandit with templates, LLM generation, hybrid bandit+LLM, and LLM with interaction history. LLM-based approaches were rated substantially more helpful than templates, but no significant differences emerged among LLM conditions. Unexpectedly, bandit optimisation for BCTs selection yielded no additional perceived helpfulness compared with LLM-only approaches. Unconstrained LLMs focused heavily on a single BCT, whereas bandit systems enforced systematic exploration-exploitation across techniques. Quantitative and qualitative findings suggest contextual acknowledgement of user input drove perceived helpfulness. We contribute design suggestions for reflective AI health behaviour change systems that address a trade-off between structured exploration and generative autonomy.

LGJun 8, 2025
Investigating the Relationship Between Physical Activity and Tailored Behavior Change Messaging: Connecting Contextual Bandit with Large Language Models

Haochen Song, Dominik Hofer, Rania Islambouli et al.

Machine learning approaches, such as contextual multi-armed bandit (cMAB) algorithms, offer a promising strategy to reduce sedentary behavior by delivering personalized interventions to encourage physical activity. However, cMAB algorithms typically require large participant samples to learn effectively and may overlook key psychological factors that are not explicitly encoded in the model. In this study, we propose a hybrid approach that combines cMAB for selecting intervention types with large language models (LLMs) to personalize message content. We evaluate four intervention types: behavioral self-monitoring, gain-framed, loss-framed, and social comparison, each delivered as a motivational message aimed at increasing motivation for physical activity and daily step count. Message content is further personalized using dynamic contextual factors including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over a seven-day trial, participants receive daily messages assigned by one of four models: cMAB alone, LLM alone, combined cMAB with LLM personalization (cMABxLLM), or equal randomization (RCT). Outcomes include daily step count and message acceptance, assessed via ecological momentary assessments (EMAs). We apply a causal inference framework to evaluate the effects of each model. Our findings offer new insights into the complementary roles of LLM-based personalization and cMAB adaptation in promoting physical activity through personalized behavioral messaging.

LGJan 7, 2025
Adaptive Experiments Under Data Sparse Settings: Applications for Educational Platforms

Haochen Song, Ilya Musabirov, Ananya Bhattacharjee et al.

Adaptive experimentation is increasingly used in educational platforms to personalize learning through dynamic content and feedback. However, standard adaptive strategies such as Thompson Sampling often underperform in real-world educational settings where content variations are numerous and student participation is limited, resulting in sparse data. In particular, Thompson Sampling can lead to imbalanced content allocation and delayed convergence on which aspects of content are most effective for student learning. To address these challenges, we introduce Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS), an algorithm that refines the sampling strategy to improve content-related decision-making in data-sparse environments. WAPTS is guided by the principle of lenient regret, allowing near-optimal allocations to accelerate learning while still exploring promising content. We evaluate WAPTS in a learnersourcing scenario where students rate peer-generated learning materials, and demonstrate that it enables earlier and more reliable identification of promising treatments.

HCDec 20, 2021
Understanding User Perspectives on Prompts for Brief Reflection on Troubling Emotions

Ananya Bhattacharjee, Pan Chen, Linjia Zhou et al.

We investigate users' perspectives on an online reflective question activity (RQA) that prompts people to externalize their underlying emotions on a troubling situation. Inspired by principles of cognitive behavioral therapy, our 15-minute activity encourages self-reflection without a human or automated conversational partner. A deployment of our RQA on Amazon Mechanical Turk suggests that people perceive several benefits from our RQA, including structured awareness of their thoughts and problem-solving around managing their emotions. Quantitative evidence from a randomized experiment suggests people find that our RQA makes them feel less worried by their selected situation and worth the minimal time investment. A further two-week technology probe deployment with 11 participants indicates that people see benefits to doing this activity repeatedly, although the activity may get monotonous over time. In summary, this work demonstrates the promise of online reflection activities that carefully leverage principles of psychology in their design.

LGMar 22, 2021
Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments

Joseph Jay Williams, Jacob Nogas, Nina Deliu et al.

Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments. In such experiments, an algorithm varies which arms (e.g. alternative interventions to help students learn) are assigned to participants, with the goal of assigning higher-reward arms to as many participants as possible. We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes. Instructors saw great value in trying to rapidly use data to give their students in the experiments better arms (e.g. better explanations of a concept). Our deployment, however, illustrated a major barrier for scientists and practitioners to use such adaptive experiments: a lack of quantifiable insight into how much statistical analysis of specific real-world experiments is impacted (Pallmann et al, 2018; FDA, 2019), compared to traditional uniform random assignment. We therefore use our case study of the ubiquitous two-arm binary reward setting to empirically investigate the impact of using Thompson Sampling instead of uniform random assignment. In this setting, using common statistical hypothesis tests, we show that collecting data with TS can as much as double the False Positive Rate (FPR; incorrectly reporting differences when none exist) and the False Negative Rate (FNR; failing to report differences when they exist)...

LGJul 17, 2020
Sequential Explanations with Mental Model-Based Policies

Arnold YS Yeung, Shalmali Joshi, Joseph Jay Williams et al.

The act of explaining across two parties is a feedback loop, where one provides information on what needs to be explained and the other provides an explanation relevant to this information. We apply a reinforcement learning framework which emulates this format by providing explanations based on the explainee's current mental model. We conduct novel online human experiments where explanations generated by various explanation methods are selected and presented to participants, using policies which observe participants' mental models, in order to optimize an interpretability proxy. Our results suggest that mental model-based policies (anchored in our proposed state representation) may increase interpretability over multiple sequential explanations, when compared to a random selection baseline. This work provides insight into how to select explanations which increase relevant information for users, and into conducting human-grounded experimentation to understand interpretability.

HCOct 12, 2019
RiPPLE: A Crowdsourced Adaptive Platform for Recommendation of Learning Activities

Hassan Khosravi, Kirsty Kitto, Joseph Jay Williams

This paper presents a platform called RiPPLE (Recommendation in Personalised Peer-Learning Environments) that recommends personalized learning activities to students based on their knowledge state from a pool of crowdsourced learning activities that are generated by educators and the students themselves. RiPPLE integrates insights from crowdsourcing, learning sciences, and adaptive learning, aiming to narrow the gap between these large bodies of research while providing a practical platform-based implementation that instructors can easily use in their courses. This paper provides a design overview of RiPPLE, which can be employed as a standalone tool or embedded into any learning management system (LMS) or online platform that supports the Learning Tools Interoperability (LTI) standard. The platform has been evaluated based on a pilot in an introductory course with 453 students at The University of Queensland. Initial results suggest that the use of the \name platform led to measurable learning gains and that students perceived the platform as beneficially supporting their learning.

AIApr 14, 2018
Combining Difficulty Ranking with Multi-Armed Bandits to Sequence Educational Content

Avi Segal, Yossi Ben David, Joseph Jay Williams et al.

As e-learning systems become more prevalent, there is a growing need for them to accommodate individual differences between students. This paper addresses the problem of how to personalize educational content to students in order to maximize their learning gains over time. We present a new computational approach to this problem called MAPLE (Multi-Armed Bandits based Personalization for Learning Environments) that combines difficulty ranking with multi-armed bandits. Given a set of target questions MAPLE estimates the expected learning gains for each question and uses an exploration-exploitation strategy to choose the next question to pose to the student. It maintains a personalized ranking over the difficulties of question in the target set which is used in two ways: First, to obtain initial estimates over the learning gains for the set of questions. Second, to update the estimates over time based on the students responses. We show in simulations that MAPLE was able to improve students' learning gains compared to approaches that sequence questions in increasing level of difficulty, or rely on content experts. When implemented in a live e-learning system in the wild, MAPLE showed promising results. This work demonstrates the efficacy of using stochastic approaches to the sequencing problem when augmented with information about question difficulty.

HCSep 15, 2015
A Methodology for Discovering how to Adaptively Personalize to Users using Experimental Comparisons

Joseph Jay Williams, Neil Heffernan

We explain and provide examples of a formalism that supports the methodology of discovering how to adapt and personalize technology by combining randomized experiments with variables associated with user models. We characterize a formal relationship between the use of technology to conduct A/B experiments and use of technology for adaptive personalization. The MOOClet Formalism [11] captures the equivalence between experimentation and personalization in its conceptualization of modular components of a technology. This motivates a unified software design pattern that enables technology components that can be compared in an experiment to also be adapted based on contextual data, or personalized based on user characteristics. With the aid of a concrete use case, we illustrate the potential of the MOOClet formalism for a methodology that uses randomized experiments of alternative micro-designs to discover how to adapt technology based on user characteristics, and then dynamically implements these personalized improvements in real time.

CYFeb 14, 2015
Supporting Instructors in Collaborating with Researchers using MOOClets

Joseph Jay Williams, Juho Kim, Brian C. Keegan

Most education and workplace learning takes place in classroom contexts far removed from laboratories or field sites with special arrangements for scientific research. But digital online resources provide a novel opportunity for large scale efforts to bridge the real world and laboratory settings which support data collection and randomized A/B experiments comparing different versions of content or interactions [2]. However, there are substantial technological and practical barriers in aligning instructors and researchers to use learning technologies like blended lessons/exercises & MOOCs as both a service for students and a realistic context to conduct research. This paper explains how the concept of a MOOClet can facilitate research-practitioner collaborations. MOOClets [3] are defined as modular components of a digital resource that can be implemented in technology to: (1) allow modification to create multiple versions, (2) allow experimental comparison and personalization of different versions, (3) reliably specify what data are collected. We suggest a framework in which instructors specify what kinds of changes to lessons, exercises, and emails they would be willing to adopt, and what data they will collect and make available. Researchers can then: (1) specify or design experiments that compare the effects of different versions on quantifiable outcomes. (2) Explore algorithms for maximizing particular outcomes by choosing alternative versions of a MOOClet based on the input variables available. We present a prototype survey tool for instructors intended to facilitate practitioner researcher matches and successful collaborations.

HCFeb 14, 2015
Using and Designing Platforms for In Vivo Education Experiments

Joseph Jay Williams, Korinn Ostrow, Xiaolu Xiong et al.

In contrast to typical laboratory experiments, the everyday use of online educational resources by large populations and the prevalence of software infrastructure for A/B testing leads us to consider how platforms can embed in vivo experiments that do not merely support research, but ensure practical improvements to their educational components. Examples are presented of randomized experimental comparisons conducted by subsets of the authors in three widely used online educational platforms Khan Academy, edX, and ASSISTments. We suggest design principles for platform technology to support randomized experiments that lead to practical improvements enabling Iterative Improvement and Collaborative Work and explain the benefit of their implementation by WPI co-authors in the ASSISTments platform.