HCMay 12
Does Algorithmic Uncertainty Sway Human Experts? Evidence from a Field Experiment in Selective College AdmissionsHansol Lee, AJ Alvero, René F. Kizilcec et al.
Algorithmic predictions are inherently uncertain: even models with similar aggregate accuracy can produce different predictions for the same individual, raising concerns that high-stakes decisions may become sensitive to arbitrary modeling choices. In this paper, we define \emph{algorithmic sensitivity} as the extent to which arbitrary modeling choices propagate into human decisions: how much a decision outcome shifts when a more favorable versus less favorable algorithmic prediction is presented to the decision-maker for the same individual. We estimate this in a randomized field experiment ($n=19{,}545$) embedded in a selective U.S. college admissions cycle, in which admissions officers reviewed each application alongside an algorithmic score while we randomly varied whether the score came from one of two similarly accurate prediction models. Although the two models performed similarly in aggregate, they frequently assigned different scores to the same applicant, creating exogenous variation in the score shown. Surprisingly, we find little evidence of algorithmic sensitivity: presenting a more favorable score does not meaningfully increase an applicant's probability of admission on average, even when the models disagree substantially. These findings suggest that, in this expert, high-stakes setting, human decision-making is largely invariant to arbitrary variation in algorithmic predictions, underscoring the role of professional discretion and institutional context in mediating the downstream effects of algorithmic uncertainty.
AIMay 24
AI Cartography: Mapping the Latent Landscape of AI Benchmark EcosystemsMichael Hardy, Anka Reuel, Lijin Zhang et al.
While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ($\approx9\%$) than architecture or deployment categories in this context; (4) a manifest-score "scaling law" slope has low reliability ($R_β=0.53$); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ($R_g=0.97$). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.
CYNov 9, 2023
Is a Seat at the Table Enough? Engaging Teachers and Students in Dataset Specification for ML in EducationMei Tan, Hansol Lee, Dakuo Wang et al.
Despite the promises of ML in education, its adoption in the classroom has surfaced numerous issues regarding fairness, accountability, and transparency, as well as concerns about data privacy and student consent. A root cause of these issues is the lack of understanding of the complex dynamics of education, including teacher-student interactions, collaborative learning, and classroom environment. To overcome these challenges and fully utilize the potential of ML in education, software practitioners need to work closely with educators and students to fully understand the context of the data (the backbone of ML applications) and collaboratively define the ML data specifications. To gain a deeper understanding of such a collaborative process, we conduct ten co-design sessions with ML software practitioners, educators, and students. In the sessions, teachers and students work with ML engineers, UX designers, and legal practitioners to define dataset characteristics for a given ML application. We find that stakeholders contextualize data based on their domain and procedural knowledge, proactively design data requirements to mitigate downstream harms and data reliability concerns, and exhibit role-based collaborative strategies and contribution patterns. Further, we find that beyond a seat at the table, meaningful stakeholder participation in ML requires structured supports: defined processes for continuous iteration and co-evaluation, shared contextual data quality standards, and information scaffolds for both technical and non-technical stakeholders to traverse expertise boundaries.
CVJun 8, 2023
IFaceUV: Intuitive Motion Facial Image Generation by Identity Preservation via UV mapHansol Lee, Yunhoe Ku, Eunseo Kim et al.
Reenacting facial images is an important task that can find numerous applications. We proposed IFaceUV, a fully differentiable pipeline that properly combines 2D and 3D information to conduct the facial reenactment task. The three-dimensional morphable face models (3DMMs) and corresponding UV maps are utilized to intuitively control facial motions and textures, respectively. Two-dimensional techniques based on 2D image warping is further required to compensate for missing components of the 3DMMs such as backgrounds, ear, hair and etc. In our pipeline, we first extract 3DMM parameters and corresponding UV maps from source and target images. Then, initial UV maps are refined by the UV map refinement network and it is rendered to the image with the motion manipulated 3DMM parameters. In parallel, we warp the source image according to the 2D flow field obtained from the 2D warping network. Rendered and warped images are combined in the final editing network to generate the final reenactment image. Additionally, we tested our model for the audio-driven facial reenactment task. Extensive qualitative and quantitative experiments illustrate the remarkable performance of our method compared to other state-of-the-art methods.
CVNov 11, 2022
HOReeNet: 3D-aware Hand-Object Grasping ReenactmentChanghwa Lee, Junuk Cha, Hansol Lee et al.
We present HOReeNet, which tackles the novel task of manipulating images involving hands, objects, and their interactions. Especially, we are interested in transferring objects of source images to target images and manipulating 3D hand postures to tightly grasp the transferred objects. Furthermore, the manipulation needs to be reflected in the 2D image space. In our reenactment scenario involving hand-object interactions, 3D reconstruction becomes essential as 3D contact reasoning between hands and objects is required to achieve a tight grasp. At the same time, to obtain high-quality 2D images from 3D space, well-designed 3D-to-2D projection and image refinement are required. Our HOReeNet is the first fully differentiable framework proposed for such a task. On hand-object interaction datasets, we compared our HOReeNet to the conventional image translation algorithms and reenactment algorithm. We demonstrated that our approach could achieved the state-of-the-art on the proposed task.
IVNov 27, 2025Code
GACELLE: GPU-accelerated tools for model parameter estimation and image reconstructionKwok-Shing Chan, Hansol Lee, Yixin Ma et al.
Quantitative MRI (qMRI) offers tissue-specific biomarkers that can be tracked over time or compared across populations; however, its adoption in clinical research is hindered by significant computational demands of parameter estimation. Images acquired at high spatial resolution or requiring fitting multiple parameters often require lengthy processing time, constraining their use in routine pipelines and slowing methodological innovation and clinical translation. We present GACELLE, an open source, GPU-accelerated framework for high-throughput qMRI analysis. GACELLE provides a stochastic gradient descent optimiser and a stochastic sampler in MATLAB, enabling fast parameter mapping, improved estimation robustness via spatial regularisation, and uncertainty quantification. GACELLE prioritises accessibility: users only need to provide a forward signal model, while GACELLE's backend manages computational parallelisation, automatic parameter updates, and memory-batching. The stochastic solver performs fully vectorised Markov chain Monte Carlo with identical likelihood on CPU and GPU, ensuring reproducibility across hardware. Benchmarking demonstrates up to 451-fold acceleration for the stochastic gradient descent solver and 14,380-fold acceleration for stochastic sampling compared to CPU-based estimation, without compromising accuracy. We demonstrated GACELLE's versatility on three representative qMRI models and on an image reconstruction task. Across these applications, GACELLE improves parameter precision, enhances test-retest reproducibility, and reduces noise in quantitative maps. By combining speed, usability and flexibility, GACELLE provides a generalisable optimisation framework for medical image analysis. It lowers the computational barrier for qMRI, paving the way for reproducible biomarker development, large-scale imaging studies, and clinical translation.
HCApr 14, 2021Code
Look at Me When I Talk to You: A Video Dataset to Enable Voice Assistants to Recognize ErrorsAndrea Cuadra, Hansol Lee, Jason Cho et al.
People interacting with voice assistants are often frustrated by voice assistants' frequent errors and inability to respond to backchannel cues. We introduce an open-source video dataset of 21 participants' interactions with a voice assistant, and explore the possibility of using this dataset to enable automatic error recognition to inform self-repair. The dataset includes clipped and labeled videos of participants' faces during free-form interactions with the voice assistant from the smart speaker's perspective. To validate our dataset, we emulated a machine learning classifier by asking crowdsourced workers to recognize voice assistant errors from watching soundless video clips of participants' reactions. We found trends suggesting it is possible to determine the voice assistant's performance from a participant's facial reaction alone. This work posits elicited datasets of interactive responses as a key step towards improving error recognition for repair for voice assistants in a wide variety of applications.
CYMay 4
A Large-Scale Observational Study on Obtaining Lightweight, Randomized Weekly Student FeedbackYunsung Kim, Hansol Lee, Candace Thille et al.
Conventional methods of obtaining student feedback on course experience face a fundamental tradeoff between feedback frequency and quality: as feedback requests become more frequent, participation often declines, and responses become less thoughtful over time. To obtain both timely and thoughtful feedback from students, Kim and Piech (Learning at Scale, 2023) recently proposed a simple, lightweight course feedback mechanism: surveying each student a small number of times per term during randomly selected weeks. Named High-Resolution Course Feedback (HRCF), this method has been shown to elicit feedback that instructors find helpful without imposing excessive burden on students. An important question, however, remains unanswered: is the use of this simple method associated with measurable improvements in students' actual course experiences? We study HRCF use across 103 course offerings, totaling 24,216 student enrollments, over four years from Fall 2021 through Fall 2025, spanning 42 unique computer science courses at an R1 institution. Through a regression analysis of four end-of-term student evaluation items for these courses, we find that first-time use of HRCF is not associated with a measurable change in average student ratings. However, among small- and medium-enrollment (<250 students) course offerings, continued HRCF use is associated with average rating increases of 0.045 to 0.048 points per additional term of use for learning-related items. We observe no statistically significant associations for large-enrollment (250 or more students) course offerings, nor for items measuring instructional quality and course organization. Together, these findings suggest that sustained HRCF use may support improvements in students' learning experiences, but that further design enhancements may be needed to produce measurable improvements in instructional quality and course organization.
CYMay 3
The "Astonishing Regularity'' Revisited: Sensitivity of Learning-Rate Estimates to Practice-Sequence LengthHansol Lee, Guilherme Lichand, Cristina Barnard et al.
A 2023 \textit{PNAS} study by Koedinger et al. (2023) fit the individual Additive Factors Model (iAFM) to 27 educational datasets and reported an ``astonishing regularity'' in student learning rates: students vary substantially in initial knowledge but learn at remarkably similar rates with practice. We probe a largely unexamined assumption underlying this finding -- that observation length in student log data is ignorable for mixed-effects estimation -- by refitting the iAFM on 26 of the original datasets while systematically truncating practice sequences at various depths, holding the set of students and knowledge components constant. Capping at the first ten opportunities per student-skill pair inflates the median estimated IQR of student learning rates by 75\%; capping at five inflates it by 205\%, with individual datasets ranging from negligible to 17-fold. The magnitude of this sensitivity diverges from what standard estimation theory predicts under ignorable truncation, and the dataset-specific heterogeneity is substantial. Three candidate mechanisms from adjacent literatures could account for the pattern -- informative observation length, functional-form misspecification, and identification weakness from sparse per-pair data -- but observational analysis on these data alone cannot adjudicate among them. We argue that practice sequence length distributions are an unexamined property of mixed-effects estimation on observational learning data, deserving explicit reporting before conclusions about learning-rate heterogeneity are drawn.
CVJan 12, 2024
3D Reconstruction of Interacting Multi-Person in Clothing from a Single ImageJunuk Cha, Hansol Lee, Jaewon Kim et al.
This paper introduces a novel pipeline to reconstruct the geometry of interacting multi-person in clothing on a globally coherent scene space from a single image. The main challenge arises from the occlusion: a part of a human body is not visible from a single view due to the occlusion by others or the self, which introduces missing geometry and physical implausibility (e.g., penetration). We overcome this challenge by utilizing two human priors for complete 3D geometry and surface contacts. For the geometry prior, an encoder learns to regress the image of a person with missing body parts to the latent vectors; a decoder decodes these vectors to produce 3D features of the associated geometry; and an implicit network combines these features with a surface normal map to reconstruct a complete and detailed 3D humans. For the contact prior, we develop an image-space contact detector that outputs a probability distribution of surface contacts between people in 3D. We use these priors to globally refine the body poses, enabling the penetration-free and accurate reconstruction of interacting multi-person in clothing on the scene space. The results demonstrate that our method is complete, globally coherent, and physically plausible compared to existing methods.
CVDec 28, 2023
Dynamic Appearance Modeling of Clothed 3D Human Avatars using a Single CameraHansol Lee, Junuk Cha, Yunhoe Ku et al.
The appearance of a human in clothing is driven not only by the pose but also by its temporal context, i.e., motion. However, such context has been largely neglected by existing monocular human modeling methods whose neural networks often struggle to learn a video of a person with large dynamics due to the motion ambiguity, i.e., there exist numerous geometric configurations of clothes that are dependent on the context of motion even for the same pose. In this paper, we introduce a method for high-quality modeling of clothed 3D human avatars using a video of a person with dynamic movements. The main challenge comes from the lack of 3D ground truth data of geometry and its temporal correspondences. We address this challenge by introducing a novel compositional human modeling framework that takes advantage of both explicit and implicit human modeling. For explicit modeling, a neural network learns to generate point-wise shape residuals and appearance features of a 3D body model by comparing its 2D rendering results and the original images. This explicit model allows for the reconstruction of discriminative 3D motion features from UV space by encoding their temporal correspondences. For implicit modeling, an implicit network combines the appearance and 3D motion features to decode high-fidelity clothed 3D human avatars with motion-dependent geometry and texture. The experiments show that our method can generate a large variation of secondary motion in a physically plausible way.
CYJul 10, 2020
Algorithmic Fairness in EducationRené F. Kizilcec, Hansol Lee
Data-driven predictive models are increasingly used in education to support students, instructors, and administrators. However, there are concerns about the fairness of the predictions and uses of these algorithmic systems. In this introduction to algorithmic fairness in education, we draw parallels to prior literature on educational access, bias, and discrimination, and we examine core components of algorithmic systems (measurement, model learning, and action) to identify sources of bias and discrimination in the process of developing and deploying these systems. Statistical, similarity-based, and causal notions of fairness are reviewed and contrasted in the way they apply in educational contexts. Recommendations for policy makers and developers of educational technology offer guidance for how to promote algorithmic fairness in education.
CYJun 30, 2020
Evaluation of Fairness Trade-offs in Predicting Student SuccessHansol Lee, René F. Kizilcec
Predictive models for identifying at-risk students early can help teaching staff direct resources to better support them, but there is a growing concern about the fairness of algorithmic systems in education. Predictive models may inadvertently introduce bias in who receives support and thereby exacerbate existing inequities. We examine this issue by building a predictive model of student success based on university administrative records. We find that the model exhibits gender and racial bias in two out of three fairness measures considered. We then apply post-hoc adjustments to improve model fairness to highlight trade-offs between the three fairness measures.