IRMay 24, 2022Code
RecipeRec: A Heterogeneous Graph Learning Model for Recipe RecommendationYijun Tian, Chuxu Zhang, Zhichun Guo et al.
Recipe recommendation systems play an essential role in helping people decide what to eat. Existing recipe recommendation systems typically focused on content-based or collaborative filtering approaches, ignoring the higher-order collaborative signal such as relational structure information among users, recipes and food items. In this paper, we formalize the problem of recipe recommendation with graphs to incorporate the collaborative signal into recipe recommendation through graph modeling. In particular, we first present URI-Graph, a new and large-scale user-recipe-ingredient graph. We then propose RecipeRec, a novel heterogeneous graph learning model for recipe recommendation. The proposed model can capture recipe content and collaborative signal through a heterogeneous graph neural network with hierarchical attention and an ingredient set transformer. We also introduce a graph contrastive augmentation strategy to extract informative graph knowledge in a self-supervised manner. Finally, we design a joint objective function of recommendation and contrastive learning to optimize the model. Extensive experiments demonstrate that RecipeRec outperforms state-of-the-art methods for recipe recommendation. Dataset and codes are available at https://github.com/meettyj/RecipeRec.
LGMay 24, 2022Code
Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural NetworksYijun Tian, Chuxu Zhang, Zhichun Guo et al.
Learning effective recipe representations is essential in food studies. Unlike what has been developed for image-based recipe retrieval or learning structural text embeddings, the combined effect of multi-modal information (i.e., recipe images, text, and relation data) receives less attention. In this paper, we formalize the problem of multi-modal recipe representation learning to integrate the visual, textual, and relational information into recipe embeddings. In particular, we first present Large-RG, a new recipe graph data with over half a million nodes, making it the largest recipe graph to date. We then propose Recipe2Vec, a novel graph neural network based recipe embedding model to capture multi-modal information. Additionally, we introduce an adversarial attack strategy to ensure stable learning and improve performance. Finally, we design a joint objective function of node classification and adversarial learning to optimize the model. Extensive experiments demonstrate that Recipe2Vec outperforms state-of-the-art baselines on two classic food study tasks, i.e., cuisine category classification and region prediction. Dataset and codes are available at https://github.com/meettyj/Recipe2Vec.
CLJan 28
Automated Benchmark Generation from Domain Guidelines Informed by Bloom's TaxonomySi Chen, Le Huy Khiem, Annalisa Szymanski et al.
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.
AIMar 18
TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to InstructorsIsabel Molnar, Peiyu Li, Si Chen et al.
Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic chatbot advice or non-scalable teaching center human-human consultations. We present TeachingCoach, a pedagogically grounded chatbot designed to support instructor professional development through real-time, conversational guidance. TeachingCoach is built on a data-centric pipeline that extracts pedagogical rules from educational resources and uses synthetic dialogue generation to fine-tune a specialized language model that guides instructors through problem identification, diagnosis, and strategy development. Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than a GPT-4o mini baseline, while a user study with higher education instructors highlights trade-offs between conversational depth and interaction efficiency. Together, these results demonstrate that pedagogically grounded, synthetic data driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.
CLOct 26, 2025Code
Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static BenchmarksPeiyu Li, Xiuxiu Tang, Si Chen et al.
Large language model evaluation requires thousands of benchmark items, making evaluations expensive and slow. Existing methods compute average accuracy across fixed item sets, treating all items equally despite varying quality and informativeness. We present ATLAS an adaptive testing framework using Item Response Theory (IRT) to estimate model ability through Fisher information-guided item selection. Our analysis of five major benchmarks reveals that 3-6% of items exhibit negative discrimination, indicating annotation errors that corrupt static evaluation. ATLAS achieves 90% item reduction while maintaining measurement precision: on HellaSwag (5,608 items), we match full-benchmark estimates using only 42 items with 0.154 MAE. Our framework maintains item exposure rates below 10% and test overlap at 16-27%, compared to static benchmarks where every model sees all items (100% exposure). Among 4,000+ tested models, IRT ranks differ from accuracy ranks: models with the same accuracy get different IRT scores, and 23-31% of all models shift by more than 10 rank positions. Code and calibrated item banks are available at https://github.com/Peiyu-Georgia-Li/ATLAS.git.
HCSep 15, 2025
Exploring Conversational Design Choices in LLMs for Pedagogical Purposes: Socratic and Narrative Approaches for Improving Instructor's Teaching PracticeSi Chen, Isabel R. Molnar, Peiyu Li et al.
Large language models (LLMs) typically generate direct answers, yet they are increasingly used as learning tools. Studying instructors' usage is critical, given their role in teaching and guiding AI adoption in education. We designed and evaluated TeaPT, an LLM for pedagogical purposes that supports instructors' professional development through two conversational approaches: a Socratic approach that uses guided questioning to foster reflection, and a Narrative approach that offers elaborated suggestions to extend externalized cognition. In a mixed-method study with 41 higher-education instructors, the Socratic version elicited greater engagement, while the Narrative version was preferred for actionable guidance. Subgroup analyses further revealed that less-experienced, AI-optimistic instructors favored the Socratic version, whereas more-experienced, AI-cautious instructors preferred the Narrative version. We contribute design implications for LLMs for pedagogical purposes, showing how adaptive conversational approaches can support instructors with varied profiles while highlighting how AI attitudes and experience shape interaction and learning.
AIAug 6, 2025
\textsc{SimInstruct}: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated NovicesSi Chen, Izzy Molnar, Ting Hua et al.
High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding -- the process by which an expert supports a novice's thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM's persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o's limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.
SEMar 21, 2018
How Do Practitioners Perceive Assurance Cases in Safety-Critical Software Systems?Jinghui Cheng, Micayla Goodrum, Ronald Metoyer et al.
Safety-critical software systems are those whose failure or malfunction could result in casualty and/or serious financial loss. In such systems, safety assurance cases (SACs) are an emerging approach that adopts a proactive strategy to produce structuralized safety justifications and arguments. While SACs are recommended in many software-intensive safety-critical domains, the lack of knowledge regarding the practitioners' perspectives on using SACs hinders effective adoption of this approach. To gain such knowledge, we interviewed nine practitioners and safety experts who focused on safety-critical software systems. In general, our participants found the SAC approach beneficial for communication of safety arguments and management of safety issues in a multidisciplinary setting. The challenges they faced when using SACs were primarily associated with (1) a lack of tool support, (2) insufficient process integration, and (3) scarcity of experienced personnel. To overcome those challenges, our participants suggested tactics that focused on creating direct safety arguments. Process and organizational adjustments are also needed to streamline SAC analysis and creation. Finally, our participants emphasized the importance of knowledge sharing about SACs across software-intensive safety-critical domains.
HCDec 6, 2017
Coupling Story to Visualization: Using Textual Analysis as a Bridge Between Data and InterpretationRonald Metoyer, Qiyu Zhi, Bart Janczuk et al.
Online writers and journalism media are increasingly combining visualization (and other multimedia content) with narrative text to create narrative visualizations. Often, however, the two elements are presented independently of one another. We propose an approach to automatically integrate text and visualization elements. We begin with a writer's narrative that presumably can be supported with visual data evidence. We leverage natural language processing, quantitative narrative analysis, and information visualization to (1) automatically extract narrative components (who, what, when, where) from data-rich stories, and (2) integrate the supporting data evidence with the text to develop a narrative visualization. We also employ bidirectional interaction from text to visualization and visualization to text to support reader exploration in both directions. We demonstrate the approach with a case study in the data-rich field of sports journalism.
HCAug 19, 2017
Design Space of Programming Tools on Mobile Touchscreen DevicesPoorna Talkad Sukumar, Ronald Metoyer
While mobile touchscreen devices are ubiquitous and present opportunities for novel applications, they have seen little adoption as tools for computer programming. In this literature survey, we bring together the diverse research work on programming-related tasks supported by mobile touchscreen devices to explore the design space for applying them to programming situations. We used the Grounded theory approach to identify themes and classify previous work. We present these themes and how each paper contributes to the theme, and we outline the remaining challenges in and opportunities for using mobile touchscreen devices in programming applications.
HCMay 31, 2017
Recognizing Handwritten Source CodeQiyu Zhi, Ronald Metoyer
Supporting programming on touchscreen devices requires effective text input and editing methods. Unfortunately, the virtual keyboard can be inefficient and uses valuable screen space on already small devices. Recent advances in stylus input make handwriting a potentially viable text input solution for programming on touchscreen devices. The primary barrier, however, is that handwriting recognition systems are built to take advantage of the rules of natural language, not those of a programming language. In this paper, we explore this particular problem of handwriting recognition for source code. We collect and make publicly available a dataset of handwritten Python code samples from 15 participants and we characterize the typical recognition errors for this handwritten Python source code when using a state-of-the-art handwriting recognition tool. We present an approach to improve the recognition accuracy by augmenting a handwriting recognizer with the programming language grammar rules. Our experiment on the collected dataset shows an 8.6% word error rate and a 3.6% character error rate which outperforms standard handwriting recognition systems and compares favorably to typing source code on virtual keyboards.
SEMar 28, 2017
Documenting API Input/Output ExamplesSiyuan Jiang, Ameer Armaly, Collin McMillan et al.
When learning to use an Application Programming Interface (API), programmers need to understand the inputs and outputs (I/O) of the API functions. Current documentation tools automatically document the static information of I/O, such as parameter types and names. What is missing from these tools is dynamic information, such as I/O examples---actual valid values of inputs that produce certain outputs. In this paper, we demonstrate a prototype toolset we built to generate I/O examples. Our tool logs I/O values when API functions are executed, for example in running test suites. Then, the tool puts I/O values into API documents as I/O examples. Our tool has three programs: 1) funcWatch, which collects I/O values when API developers run test suites, 2) ioSelect, which selects one I/O example from a set of I/O values, and 3) ioPresent, which embeds the I/O examples into documents. In a preliminary evaluation, we used our tool to generate four hundred I/O examples for three C libraries: ffmpeg, libssh, and protobuf-c.
LGMar 28, 2017
Fast Optimization of Wildfire Suppression Policies with SMACSean McGregor, Rachel Houtman, Claire Montgomery et al.
Managers of US National Forests must decide what policy to apply for dealing with lightning-caused wildfires. Conflicts among stakeholders (e.g., timber companies, home owners, and wildlife biologists) have often led to spirited political debates and even violent eco-terrorism. One way to transform these conflicts into multi-stakeholder negotiations is to provide a high-fidelity simulation environment in which stakeholders can explore the space of alternative policies and understand the tradeoffs therein. Such an environment needs to support fast optimization of MDP policies so that users can adjust reward functions and analyze the resulting optimal policies. This paper assesses the suitability of SMAC---a black-box empirical function optimization algorithm---for rapid optimization of MDP policies. The paper describes five reward function components and four stakeholder constituencies. It then introduces a parameterized class of policies that can be easily understood by the stakeholders. SMAC is applied to find the optimal policy in this class for the reward functions of each of the stakeholder constituencies. The results confirm that SMAC is able to rapidly find good policies that make sense from the domain perspective. Because the full-fidelity forest fire simulator is far too expensive to support interactive optimization, SMAC is applied to a surrogate model constructed from a modest number of runs of the full-fidelity simulator. To check the quality of the SMAC-optimized policies, the policies are evaluated on the full-fidelity simulator. The results confirm that the surrogate values estimates are valid. This is the first successful optimization of wildfire management policies using a full-fidelity simulation. The same methodology should be applicable to other contentious natural resource management problems where high-fidelity simulation is extremely expensive.
LGMar 28, 2017
Factoring Exogenous State for Model-Free Monte CarloSean McGregor, Rachel Houtman, Claire Montgomery et al.
Policy analysts wish to visualize a range of policies for large simulator-defined Markov Decision Processes (MDPs). One visualization approach is to invoke the simulator to generate on-policy trajectories and then visualize those trajectories. When the simulator is expensive, this is not practical, and some method is required for generating trajectories for new policies without invoking the simulator. The method of Model-Free Monte Carlo (MFMC) can do this by stitching together state transitions for a new policy based on previously-sampled trajectories from other policies. This "off-policy Monte Carlo simulation" method works well when the state space has low dimension but fails as the dimension grows. This paper describes a method for factoring out some of the state and action variables so that MFMC can work in high-dimensional MDPs. The new method, MFMCi, is evaluated on a very challenging wildfire management MDP.