CYJul 8, 2022
Computationally Identifying Funneling and Focusing Questions in Classroom DiscourseSterling Alic, Dorottya Demszky, Zid Mancenido et al. · stanford
Responsive teaching is a highly effective strategy that promotes student learning. In math classrooms, teachers might "funnel" students towards a normative answer or "focus" students to reflect on their own thinking, deepening their understanding of math concepts. When teachers focus, they treat students' contributions as resources for collective sensemaking, and thereby significantly improve students' achievement and confidence in mathematics. We propose the task of computationally detecting funneling and focusing questions in classroom discourse. We do so by creating and releasing an annotated dataset of 2,348 teacher utterances labeled for funneling and focusing questions, or neither. We introduce supervised and unsupervised approaches to differentiating these questions. Our best model, a supervised RoBERTa model fine-tuned on our dataset, has a strong linear correlation of .76 with human expert labels and with positive educational outcomes, including math instruction quality and student achievement, showing the model's potential for use in automated teacher feedback tools. Our unsupervised measures show significant but weaker correlations with human labels and outcomes, and they highlight interesting linguistic patterns of funneling and focusing questions. The high performance of the supervised measure indicates its promise for supporting teachers in their instruction.
CLNov 21, 2022Code
The NCTE Transcripts: A Dataset of Elementary Math Classroom TranscriptsDorottya Demszky, Heather Hill
Classroom discourse is a core medium of instruction - analyzing it can provide a window into teaching and learning as well as driving the development of new tools for improving instruction. We introduce the largest dataset of mathematics classroom transcripts available to researchers, and demonstrate how this data can help improve instruction. The dataset consists of 1,660 45-60 minute long 4th and 5th grade elementary mathematics observations collected by the National Center for Teacher Effectiveness (NCTE) between 2010-2013. The anonymized transcripts represent data from 317 teachers across 4 school districts that serve largely historically marginalized students. The transcripts come with rich metadata, including turn-level annotations for dialogic discourse moves, classroom observation scores, demographic information, survey responses and student test scores. We demonstrate that our natural language processing model, trained on our turn-level annotations, can learn to identify dialogic discourse moves and these moves are correlated with better classroom observation scores and learning outcomes. This dataset opens up several possibilities for researchers, educators and policymakers to learn about and improve K-12 instruction. The dataset can be found at https://github.com/ddemszky/classroom-transcript-analysis.
CLOct 16, 2023Code
Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math MistakesRose E. Wang, Qingyang Zhang, Carly Robinson et al.
Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert's latent thought process into a decision-making model for remediation. This involves an expert identifying (A) the student's error, (B) a remediation strategy, and (C) their intention before generating a response. We construct a dataset of 700 real tutoring conversations, annotated by experts with their decisions. We evaluate state-of-the-art LLMs on our dataset and find that the expert's decision-making model is critical for LLMs to close the gap: responses from GPT4 with expert decisions (e.g., "simplify the problem") are +76% more preferred than without. Additionally, context-sensitive decisions are critical to closing pedagogical gaps: random decisions decrease GPT4's response quality by -97% than expert decisions. Our work shows the potential of embedding expert thought processes in LLM generations to enhance their capability to bridge novice-expert knowledge gaps. Our dataset and code can be found at: \url{https://github.com/rosewang2008/bridge}.
CLJun 5, 2023
Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom InstructionRose E. Wang, Dorottya Demszky
Coaching, which involves classroom observation and expert feedback, is a widespread and fundamental part of teacher training. However, the majority of teachers do not have access to consistent, high quality coaching due to limited resources and access to expertise. We explore whether generative AI could become a cost-effective complement to expert feedback by serving as an automated teacher coach. In doing so, we propose three teacher coaching tasks for generative AI: (A) scoring transcript segments based on classroom observation instruments, (B) identifying highlights and missed opportunities for good instructional strategies, and (C) providing actionable suggestions for eliciting more student reasoning. We recruit expert math teachers to evaluate the zero-shot performance of ChatGPT on each of these tasks for elementary math classroom transcripts. Our results reveal that ChatGPT generates responses that are relevant to improving instruction, but they are often not novel or insightful. For example, 82% of the model's suggestions point to places in the transcript where the teacher is already implementing that suggestion. Our work highlights the challenges of producing insightful, novel and truthful feedback for teachers while paving the way for future research to address these obstacles and improve the capacity of generative AI to coach teachers.
ASSep 12, 2023
Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. AdultsAhmed Adel Attia, Jing Liu, Wei Ai et al.
Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.
CLJun 15, 2023
SIGHT: A Large Annotated Dataset on Student Insights Gathered from Higher Education TranscriptsRose E. Wang, Pawan Wirawarn, Noah Goodman et al.
Lectures are a learning experience for both students and teachers. Students learn from teachers about the subject material, while teachers learn from students about how to refine their instruction. However, online student feedback is unstructured and abundant, making it challenging for teachers to learn and improve. We take a step towards tackling this challenge. First, we contribute a dataset for studying this problem: SIGHT is a large dataset of 288 math lecture transcripts and 15,784 comments collected from the Massachusetts Institute of Technology OpenCourseWare (MIT OCW) YouTube channel. Second, we develop a rubric for categorizing feedback types using qualitative analysis. Qualitative analysis methods are powerful in uncovering domain-specific insights, however they are costly to apply to large data sources. To overcome this challenge, we propose a set of best practices for using large language models (LLMs) to cheaply classify the comments at scale. We observe a striking correlation between the model's and humans' annotation: Categories with consistent human annotations (>$0.9$ inter-rater reliability, IRR) also display higher human-model agreement (>$0.7$), while categories with less consistent human annotations ($0.7$-$0.8$ IRR) correspondingly demonstrate lower human-model agreement ($0.3$-$0.5$). These techniques uncover useful student feedback from thousands of comments, costing around $\$0.002$ per comment. We conclude by discussing exciting future directions on using online student feedback and improving automated annotation techniques for qualitative research.
CYNov 2, 2023
Measuring Five Accountable Talk Moves to Improve Instruction at ScaleAshlee Kupor, Candice Morgan, Dorottya Demszky
Providing consistent, individualized feedback to teachers on their instruction can improve student learning outcomes. Such feedback can especially benefit novice instructors who teach on online platforms and have limited access to instructional training. To build scalable measures of instruction, we fine-tune RoBERTa and GPT models to identify five instructional talk moves inspired by accountable talk theory: adding on, connecting, eliciting, probing and revoicing students' ideas. We fine-tune these models on a newly annotated dataset of 2500 instructor utterances derived from transcripts of small group instruction in an online computer science course, Code in Place. Although we find that GPT-3 consistently outperforms RoBERTa in terms of precision, its recall varies significantly. We correlate the instructors' use of each talk move with indicators of student engagement and satisfaction, including students' section attendance, section ratings, and assignment completion rates. We find that using talk moves generally correlates positively with student outcomes, and connecting student ideas has the largest positive impact. These results corroborate previous research on the effectiveness of accountable talk moves and provide exciting avenues for using these models to provide instructors with useful, scalable feedback.
CLOct 16, 2023
"Mistakes Help Us Grow": Facilitating and Evaluating Growth Mindset Supportive Language in ClassroomsKunal Handa, Margaret Clapper, Jessica Boyle et al.
Teachers' growth mindset supportive language (GMSL)--rhetoric emphasizing that one's skills can be improved over time--has been shown to significantly reduce disparities in academic achievement and enhance students' learning outcomes. Although teachers espouse growth mindset principles, most find it difficult to adopt GMSL in their practice due the lack of effective coaching in this area. We explore whether large language models (LLMs) can provide automated, personalized coaching to support teachers' use of GMSL. We establish an effective coaching tool to reframe unsupportive utterances to GMSL by developing (i) a parallel dataset containing GMSL-trained teacher reframings of unsupportive statements with an accompanying annotation guide, (ii) a GMSL prompt framework to revise teachers' unsupportive language, and (iii) an evaluation framework grounded in psychological theory for evaluating GMSL with the help of students and teachers. We conduct a large-scale evaluation involving 174 teachers and 1,006 students, finding that both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. We also find that model-generated reframings outperform those from the GMSL-trained teachers. These results show promise for harnessing LLMs to provide automated GMSL feedback for teachers and, more broadly, LLMs' potentiality for supporting students' learning in the classroom. Our findings also demonstrate the benefit of large-scale human evaluations when applying LLMs in educational domains.
71.2AIApr 2
Mitigating LLM biases toward spurious social contexts using direct preference optimizationHyunji Nam, Dorottya Demszky
LLMs are increasingly used for high-stakes decision-making, yet their sensitivity to spurious contextual information can introduce harmful biases. This is a critical concern when models are deployed for tasks like evaluating teachers' instructional quality, where biased assessment can affect teachers' professional development and career trajectories. We investigate model robustness to spurious social contexts using the largest publicly available dataset of U.S. classroom transcripts (NCTE) paired with expert rubric scores. Evaluating seven frontier and open-weight models across seven categories of spurious contexts -- including teacher experience, education level, demographic identity, and sycophancy-inducing framings -- we find that irrelevant contextual information can shift model predictions by up to 1.48 points on a 7-point scale, with larger models sometimes exhibiting greater sensitivity despite higher predictive accuracy. Mitigations using prompts and standard direct preference optimization (DPO) prove largely insufficient. We propose **Debiasing-DPO**,, a self-supervised training method that pairs neutral reasoning generated from the query alone, with the model's biased reasoning generated with both the query and additional spurious context. We further combine this objective with supervised fine-tuning on ground-truth labels to prevent losses in predictive accuracy. Applied to Llama 3B \& 8B and Qwen 3B \& 7B Instruct models, Debiasing-DPO reduces bias by 84\% and improves predictive accuracy by 52\% on average. Our findings from the educational case study highlight that robustness to spurious context is not a natural byproduct of model scaling and that our proposed method can yield substantial gains in both accuracy and robustness for prompt-based prediction tasks.
87.2HCMar 23
Practitioner Voices Summit: How Teachers Evaluate AI Tools through Deliberative SensemakingDorottya Demszky, Christopher Mah, Helen Higgins
Teachers face growing pressure to integrate AI tools into their classrooms, yet are rarely positioned as agentic decision-makers in this process. Understanding the criteria teachers use to evaluate AI tools, and the conditions that support such reasoning, is essential for responsible AI integration. We address this gap through a two-day national summit in which 61 U.S. K-12 mathematics educators developed personal rubrics for evaluating AI classroom tools. The summit was designed to support deliberative sensemaking, a process we conceptualize by integrating Technological Pedagogical Content Knowledge (TPACK) with deliberative agency. Teachers generated over 200 criteria - initial articulations spanning four higher-order themes (Practical, Equitable, Flexible, and Rigorous) - that addressed both AI outputs and the process of using AI. Criteria contained productive tensions (e.g., personalization versus fairness, adaptability versus efficiency), and the vast majority framed AI as an assistant rather than a coaching tool for professional learning. Analysis of surveys, interviews, and summit discussions revealed five mechanisms supporting deliberative sensemaking: time and space for deliberation, artifact-centered sensemaking, collaborative reflection through diverse viewpoints, knowledge-building, and psychological safety. Across these mechanisms, TPACK and agency operated in a mutually reinforcing cycle - knowledge-building enabled more grounded evaluative judgment, while the act of constructing criteria deepened teachers' understanding of tools. We discuss implications for edtech developers seeking practitioner input, school leaders making adoption decisions, educators and professional learning designers, and researchers working to elicit teachers' evaluative reasoning about rapidly evolving technologies.
CLSep 13, 2024
CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom EnvironmentsAhmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi et al.
Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones and classroom conditions.
AINov 11, 2025
DiagramIR: An Automatic Pipeline for Educational Math Diagram EvaluationVishal Kumar, Shubhra Mishra, Rebecca Hao et al.
Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.
CLFeb 7, 2024Code
Edu-ConvoKit: An Open-Source Library for Education Conversation DataRose E. Wang, Dorottya Demszky
We introduce Edu-ConvoKit, an open-source library designed to handle pre-processing, annotation and analysis of conversation data in education. Resources for analyzing education conversation data are scarce, making the research challenging to perform and therefore hard to access. We address these challenges with Edu-ConvoKit. Edu-ConvoKit is open-source (https://github.com/stanfordnlp/edu-convokit ), pip-installable (https://pypi.org/project/edu-convokit/ ), with comprehensive documentation (https://edu-convokit.readthedocs.io/en/latest/ ). Our demo video is available at: https://youtu.be/zdcI839vAko?si=h9qlnl76ucSuXb8- . We include additional resources, such as Colab applications of Edu-ConvoKit to three diverse education datasets and a repository of Edu-ConvoKit related papers, that can be found in our GitHub repository.
IRMar 6, 2024Code
Backtracing: Retrieving the Cause of the QueryRose E. Wang, Pawan Wirawarn, Omar Khattab et al.
Many online content portals allow users to ask questions to supplement their understanding (e.g., of lectures). While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators -- such as lecturers who want to improve their content -- identify segments that _caused_ a user to ask those questions. We introduce the task of backtracing, in which systems retrieve the text segment that most likely caused a user query. We formalize three real-world domains for which backtracing is important in improving content delivery and communication: understanding the cause of (a) student confusion in the Lecture domain, (b) reader curiosity in the News Article domain, and (c) user emotion in the Conversation domain. We evaluate the zero-shot performance of popular information retrieval methods and language modeling methods, including bi-encoder, re-ranking and likelihood-based methods and ChatGPT. While traditional IR systems retrieve semantically relevant information (e.g., details on "projection matrices" for a query "does projecting multiple times still lead to the same point?"), they often miss the causally relevant context (e.g., the lecturer states "projecting twice gets me the same answer as one projection"). Our results show that there is room for improvement on backtracing and it requires new retrieval approaches. We hope our benchmark serves to improve future retrieval systems for backtracing, spawning systems that refine content generation and identify linguistic triggers influencing user queries. Our code and data are open-sourced: https://github.com/rosewang2008/backtracing.
CLJul 7, 2025Code
EduCoder: An Open-Source Annotation System for Education Transcript DataGuanzhong Pan, Mei Tan, Hyunji Nam et al.
We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts -- with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson's purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators' responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.
34.7AIApr 30
Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AIDorottya Demszky, Edith Bouton, Alison Twiner et al.
Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping this methodological space along three dimensions--scale, duration, and modality--where a study's position shapes what it reveals and obscures. We illustrate it through contrasting studies of dialogic teaching--Howe et al. (2019) and Snell and Lefstein (2018)--and an interview with the lead researchers, organized around three questions: what can be operationalized, what mechanisms become visible, and what translates to practice. We then examine how AI is expanding this space and how the framework can guide research and tool design.
7.0CLApr 18
From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality AssessmentIvo Bueno, Babette Bühler, Philipp Stark et al.
Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.
CLMay 15, 2024
Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom SettingsAhmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi et al.
Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones, classroom conditions as well as classroom demographics. Our CPT models show improved ability to generalize to different demographics unseen in the labeled finetuning data.
46.3CLMar 12
Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing FeedbackMei Tan, Lena Phalen, Dorottya Demszky
Effective personalized feedback is critical to students' literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how "personalization" shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes--even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias--overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.
CLOct 6, 2025
TeachLM: Post-Training LLMs for Education Using Authentic Learning DataJanos Perczel, Jin Chow, Dorottya Demszky
The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.
CLMay 29, 2025
Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary ThemesLi Lucy, Camilla Griffiths, Sarah Levine et al. · berkeley
Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.
CLSep 2, 2025
IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive AnnotationsHyunji Nam, Lucia Langlois, James Malamut et al.
Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a "pick-the-odd-one-out" triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.
ASMay 20, 2025
From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate DataAhmed Adel Attia, Dorottya Demszky, Jing Liu et al.
Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.
CLNov 12, 2024
Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring ConversationsRose E. Wang, Pawan Wirawarn, Kenny Lam et al.
Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present LessonLink, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR's practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.
CLMay 19, 2023
MD3: The Multi-Dialect Dataset of DialoguesJacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera et al.
We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States. The Multi-Dialect Dataset of Dialogues (MD3) strikes a new balance between open-ended conversational speech and task-oriented dialogue by prompting participants to perform a series of short information-sharing tasks. This facilitates quantitative cross-dialectal comparison, while avoiding the imposition of a restrictive task structure that might inhibit the expression of dialect features. Preliminary analysis of the dataset reveals significant differences in syntax and in the use of discourse markers. The dataset, which will be made publicly available with the publication of this paper, includes more than 20 hours of audio and more than 200,000 orthographically-transcribed tokens.
CLJun 7, 2021
Measuring Conversational Uptake: A Case Study on Student-Teacher InteractionsDorottya Demszky, Jing Liu, Zid Mancenido et al.
In conversation, uptake happens when a speaker builds on the contribution of their interlocutor by, for example, acknowledging, repeating or reformulating what they have said. In education, teachers' uptake of student contributions has been linked to higher student achievement. Yet measuring and improving teachers' uptake at scale is challenging, as existing methods require expensive annotation by experts. We propose a framework for computationally measuring uptake, by (1) releasing a dataset of student-teacher exchanges extracted from US math classroom transcripts annotated for uptake by experts; (2) formalizing uptake as pointwise Jensen-Shannon Divergence (pJSD), estimated via next utterance classification; (3) conducting a linguistically-motivated comparison of different unsupervised measures and (4) correlating these measures with educational outcomes. We find that although repetition captures a significant part of uptake, pJSD outperforms repetition-based baselines, as it is capable of identifying a wider range of uptake phenomena like question answering and reformulation. We apply our uptake measure to three different educational datasets with outcome indicators. Unlike baseline measures, pJSD correlates significantly with instruction quality in all three, providing evidence for its generalizability and for its potential to serve as an automated professional development tool for teachers.
CLOct 23, 2020
Learning to Recognize Dialect FeaturesDorottya Demszky, Devyani Sharma, Jonathan H. Clark et al.
Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in "He {} running". In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.
CLJun 16, 2020
The Role of Verb Semantics in Hungarian Verb-Object OrderDorottya Demszky, László Kálmán, Dan Jurafsky et al.
Hungarian is often referred to as a discourse-configurational language, since the structural position of constituents is determined by their logical function (topic or comment) rather than their grammatical function (e.g., subject or object). We build on work by Komlósy (1989) and argue that in addition to discourse context, the lexical semantics of the verb also plays a significant role in determining Hungarian word order. In order to investigate the role of lexical semantics in determining Hungarian word order, we conduct a large-scale, data-driven analysis on the ordering of 380 transitive verbs and their objects, as observed in hundreds of thousands of examples extracted from the Hungarian Gigaword Corpus. We test the effect of lexical semantics on the ordering of verbs and their objects by grouping verbs into 11 semantic classes. In addition to the semantic class of the verb, we also include two control features related to information structure, object definiteness and object NP weight, chosen to allow a comparison of their effect size to that of verb semantics. Our results suggest that all three features have a significant effect on verb-object ordering in Hungarian and among these features, the semantic class of the verb has the largest effect. Specifically, we find that stative verbs, such as fed "cover", jelent "mean" and övez "surround", tend to be OV-preferring (with the exception of psych verbs which are strongly VO-preferring) and non-stative verbs, such as bírál "judge", csökkent "reduce" and csókol "kiss", verbs tend to be VO-preferring. These findings support our hypothesis that lexical semantic factors influence word order in Hungarian.
CLMay 1, 2020
GoEmotions: A Dataset of Fine-Grained EmotionsDorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko et al.
Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.
CLApr 2, 2019
Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass ShootingsDorottya Demszky, Nikhil Garg, Rob Voigt et al.
We provide an NLP framework to uncover four linguistic dimensions of political polarization in social media: topic choice, framing, affect and illocutionary force. We quantify these aspects with existing lexical methods, and propose clustering of tweet embeddings as a means to identify salient topics for analysis across events; human evaluations show that our approach generates more cohesive topics than traditional LDA-based models. We apply our methods to study 4.4M tweets on 21 mass shootings. We provide evidence that the discussion of these events is highly polarized politically and that this polarization is primarily driven by partisan differences in framing rather than topic choice. We identify framing devices, such as grounding and the contrasting use of the terms "terrorist" and "crazy", that contribute to polarization. Results pertaining to topic choice, affect and illocutionary force suggest that Republicans focus more on the shooter and event-specific facts (news) while Democrats focus more on the victims and call for policy changes. Our work contributes to a deeper understanding of the way group divisions manifest in language and to computational methods for studying them.
CLSep 9, 2018
Transforming Question Answering Datasets Into Natural Language Inference DatasetsDorottya Demszky, Kelvin Guu, Percy Liang
Existing datasets for natural language inference (NLI) have propelled research on language understanding. We propose a new method for automatically deriving NLI datasets from the growing abundance of large-scale question answering datasets. Our approach hinges on learning a sentence transformation model which converts question-answer pairs into their declarative forms. Despite being primarily trained on a single QA dataset, we show that it can be successfully applied to a variety of other QA resources. Using this system, we automatically derive a new freely available dataset of over 500k NLI examples (QA-NLI), and show that it exhibits a wide range of inference phenomena rarely seen in previous NLI datasets.