CLSep 4, 2022
ArgLegalSumm: Improving Abstractive Summarization of Legal Documents with Argument MiningMohamed Elaraby, Diane Litman
A challenging task when generating summaries of legal documents is the ability to address their argumentative nature. We introduce a simple technique to capture the argumentative structure of legal documents by integrating argument role labeling into the summarization process. Experiments with pretrained language models show that our proposed approach improves performance over strong baselines
CLSep 14, 2022
ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness MiningZhexiong Liu, Meiqi Guo, Yue Dai et al.
The growing interest in developing corpora of persuasive texts has promoted applications in automated systems, e.g., debating and essay scoring systems; however, there is little prior work mining image persuasiveness from an argumentative perspective. To expand persuasiveness mining into a multi-modal realm, we present a multi-modal dataset, ImageArg, consisting of annotations of image persuasiveness in tweets. The annotations are based on a persuasion taxonomy we developed to explore image functionalities and the means of persuasion. We benchmark image persuasiveness tasks on ImageArg using widely-used multi-modal learning methods. The experimental results show that our dataset offers a useful resource for this rich and challenging topic, and there is ample room for modeling improvement.
CLJun 1, 2023
Towards Argument-Aware Abstractive Summarization of Long Legal Opinions with Summary RerankingMohamed Elaraby, Yang Zhong, Diane Litman
We propose a simple approach for the abstractive summarization of long legal opinions that considers the argument structure of the document. Legal opinions often contain complex and nuanced argumentation, making it challenging to generate a concise summary that accurately captures the main points of the legal opinion. Our approach involves using argument role information to generate multiple candidate summaries, then reranking these candidates based on alignment with the document's argument structure. We demonstrate the effectiveness of our approach on a dataset of long legal opinions and show that it outperforms several strong baselines.
CLNov 6, 2022
Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization of Legal Case DecisionsYang Zhong, Diane Litman
Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.
CLJun 3, 2022
ArgRewrite V.2: an Annotated Argumentative Revisions CorpusOmid Kashefi, Tazin Afrin, Meghan Dale et al.
Analyzing how humans revise their writings is an interesting research question, not only from an educational perspective but also in terms of artificial intelligence. Better understanding of this process could facilitate many NLP applications, from intelligent tutoring systems to supportive and collaborative writing environments. Developing these applications, however, requires revision corpora, which are not widely available. In this work, we present ArgRewrite V.2, a corpus of annotated argumentative revisions, collected from two cycles of revisions to argumentative essays about self-driving cars. Annotations are provided at different levels of purpose granularity (coarse and fine) and scope (sentential and subsentential). In addition, the corpus includes the revision goal given to each writer, essay scores, annotation verification, pre- and post-study surveys collected from participants as meta-data. The variety of revision unit scope and purpose granularity levels in ArgRewrite, along with the inclusion of new types of meta-data, can make it a useful resource for research and applications that involve revision analysis. We demonstrate some potential applications of ArgRewrite V.2 in the development of automatic revision purpose predictors, as a training source and benchmark.
CLJun 21, 2023
Utilizing Natural Language Processing for Automated Assessment of Classroom DiscussionNhat Tran, Benjamin Pierce, Diane Litman et al.
Rigorous and interactive class discussions that support students to engage in high-level thinking and reasoning are essential to learning and are a central component of most teaching interventions. However, formally assessing discussion quality 'at scale' is expensive and infeasible for most researchers. In this work, we experimented with various modern natural language processing (NLP) techniques to automatically generate rubric scores for individual dimensions of classroom text discussion quality. Specifically, we worked on a dataset of 90 classroom discussion transcripts consisting of over 18000 turns annotated with fine-grained Analyzing Teaching Moves (ATM) codes and focused on four Instructional Quality Assessment (IQA) rubrics. Despite the limited amount of data, our work shows encouraging results in some of the rubrics while suggesting that there is room for improvement in the others. We also found that certain NLP approaches work better for certain rubrics.
CLJun 1, 2023
Predicting the Quality of Revisions in Argumentative WritingZhexiong Liu, Diane Litman, Elaine Wang et al.
The ability to revise in response to feedback is critical to students' writing success. In the case of argument writing in specific, identifying whether an argument revision (AR) is successful or not is a complex problem because AR quality is dependent on the overall content of an argument. For example, adding the same evidence sentence could strengthen or weaken existing claims in different argument contexts (ACs). To address this issue we developed Chain-of-Thought prompts to facilitate ChatGPT-generated ACs for AR quality predictions. The experiments on two corpora, our annotated elementary essays and existing college essays benchmark, demonstrate the superiority of the proposed ACs over baselines.
CLFeb 10, 2023
Predicting Desirable Revisions of Evidence and Reasoning in Argumentative WritingTazin Afrin, Diane Litman
We develop models to classify desirable evidence and desirable reasoning revisions in student argumentative writing. We explore two ways to improve classifier performance - using the essay context of the revision, and using the feedback students received before the revision. We perform both intrinsic and extrinsic evaluation for each of our models and report a qualitative analysis. Our results show that while a model using feedback information improves over a baseline model, models utilizing context - either alone or with feedback - are the most successful in identifying desirable revisions.
CLSep 23, 2022
Comparison of Lexical Alignment with a Teachable Robot in Human-Robot and Human-Human-Robot InteractionsYuya Asano, Diane Litman, Mingzhi Yu et al.
Speakers build rapport in the process of aligning conversational behaviors with each other. Rapport engendered with a teachable agent while instructing domain material has been shown to promote learning. Past work on lexical alignment in the field of education suffers from limitations in both the measures used to quantify alignment and the types of interactions in which alignment with agents has been studied. In this paper, we apply alignment measures based on a data-driven notion of shared expressions (possibly composed of multiple words) and compare alignment in one-on-one human-robot (H-R) interactions with the H-R portions of collaborative human-human-robot (H-H-R) interactions. We find that students in the H-R setting align with a teachable robot more than in the H-H-R setting and that the relationship between lexical alignment and rapport is more complex than what is predicted by previous theoretical and empirical work.
HCJun 11, 2023
Impact of Experiencing Misrecognition by Teachable Agents on Learning and RapportYuya Asano, Diane Litman, Mingzhi Yu et al.
While speech-enabled teachable agents have some advantages over typing-based ones, they are vulnerable to errors stemming from misrecognition by automatic speech recognition (ASR). These errors may propagate, resulting in unexpected changes in the flow of conversation. We analyzed how such changes are linked with learning gains and learners' rapport with the agents. Our results show they are not related to learning gains or rapport, regardless of the types of responses the agents should have returned given the correct input from learners without ASR errors. We also discuss the implications for optimal error-recovery policies for teachable agents that can be drawn from these findings.
CLSep 13, 2023
Learning from Auxiliary Sources in Argumentative Revision ClassificationTazin Afrin, Diane Litman
We develop models to classify desirable reasoning revisions in argumentative writing. We explore two approaches -- multi-task learning and transfer learning -- to take advantage of auxiliary sources of revision data for similar tasks. Results of intrinsic and extrinsic evaluations show that both approaches can indeed improve classifier performance over baselines. While multi-task learning shows that training on different sources of data at the same time may improve performance, transfer-learning better represents the relationship between the data.
CLOct 15, 2023
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument MiningZhexiong Liu, Mohamed Elaraby, Yang Zhong et al.
This paper presents an overview of the ImageArg shared task, the first multimodal Argument Mining shared task co-located with the 10th Workshop on Argument Mining at EMNLP 2023. The shared task comprises two classification subtasks - (1) Subtask-A: Argument Stance Classification; (2) Subtask-B: Image Persuasiveness Classification. The former determines the stance of a tweet containing an image and a piece of text toward a controversial topic (e.g., gun control and abortion). The latter determines whether the image makes the tweet text more persuasive. The shared task received 31 submissions for Subtask-A and 21 submissions for Subtask-B from 9 different teams across 6 countries. The top submission in Subtask-A achieved an F1-score of 0.8647 while the best submission in Subtask-B achieved an F1-score of 0.5561.
CLSep 29, 2023
STRONG -- Structure Controllable Legal Opinion Summary GenerationYang Zhong, Diane Litman
We propose an approach for the structure controllable summarization of long legal opinions that considers the argument structure of the document. Our approach involves using predicted argument role information to guide the model in generating coherent summaries that follow a provided structure pattern. We demonstrate the effectiveness of our approach on a dataset of legal opinions and show that it outperforms several strong baselines with respect to ROUGE, BERTScore, and structure similarity.
CLJun 20, 2024Code
Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument RankingMohamed Elaraby, Diane Litman, Xiang Lorraine Li et al.
Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.
CLJan 10, 2025
Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AIYuya Asano, Sabit Hassan, Paras Sharma et al.
General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.
CLJan 1, 2025
eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative FeedbackZhexiong Liu, Diane Litman, Elaine Wang et al.
The ability to revise essays in response to feedback is important for students' writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.
CLMar 27, 2024
ReflectSumm: A Benchmark for Course Reflection SummarizationYang Zhong, Mohamed Elaraby, Diane Litman et al.
This paper introduces ReflectSumm, a novel summarization dataset specifically designed for summarizing students' reflective writing. The goal of ReflectSumm is to facilitate developing and evaluating novel summarization techniques tailored to real-world scenarios with little training data, %practical tasks with potential implications in the opinion summarization domain in general and the educational domain in particular. The dataset encompasses a diverse range of summarization tasks and includes comprehensive metadata, enabling the exploration of various research questions and supporting different applications. To showcase its utility, we conducted extensive evaluations using multiple state-of-the-art baselines. The results provide benchmarks for facilitating further research in this area.
CLFeb 10, 2025
Discourse-Driven Evaluation: Unveiling Factual Inconsistency in Long Document SummarizationYang Zhong, Diane Litman
Detecting factual inconsistency for long document summarization remains challenging, given the complex structure of the source article and long summary length. In this work, we study factual inconsistency errors and connect them with a line of discourse analysis. We find that errors are more common in complex sentences and are associated with several discourse features. We propose a framework that decomposes long texts into discourse-inspired chunks and utilizes discourse information to better aggregate sentence-level scores predicted by natural language inference models. Our approach shows improved performance on top of different model baselines over several evaluation benchmarks, covering rich domains of texts, focusing on long document summarization. This underscores the significance of incorporating discourse features in developing models for scoring summaries for long document factual inconsistency.
CLApr 1, 2024
Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR CommunityCasey Kennington, Malihe Alikhani, Heather Pon-Barry et al. · cmu
The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first focused on education, the second on benchmarks, and the third on the modeling of language when it comes to spoken interaction with robots. The three proposals should act as white papers for any researcher to take and build upon.
CLSep 30, 2025
Efficient Layer-wise LLM Fine-tuning for Revision Intention PredictionZhexiong Liu, Diane Litman
Large Language Models (LLMs) have shown extraordinary success across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they often struggle to categorize nuanced texts. One such example is text revision, which involves nuanced edits between pairs of texts. Although simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are exceptionally expensive and scarce in the community. To address this issue, we introduce a plug-and-play layer-wise parameter-efficient fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Extensive experiments suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence, low GPU memory consumption, and effectiveness on small revision corpora.
CLMay 29, 2025
ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMsMohamed Elaraby, Diane Litman
Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns -- specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.
CLJun 12, 2024
Analyzing Large Language Models for Classroom Discussion AssessmentNhat Tran, Benjamin Pierce, Diane Litman et al.
Automatically assessing classroom discussion quality is becoming increasingly feasible with the help of new NLP advancements such as large language models (LLMs). In this work, we examine how the assessment performance of 2 LLMs interacts with 3 factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the 2 LLMs. Our results suggest that the 3 aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. We recommend a LLM-based assessment approach that has a good balance in terms of predictive performance, computational efficiency, and consistency.
CLSep 17, 2021
Mitigating Data Scarceness through Data Synthesis, Augmentation and Curriculum for Abstractive SummarizationAhmed Magooda, Diane Litman
This paper explores three simple data manipulation techniques (synthesis, augmentation, curriculum) for improving abstractive summarization models without the need for any additional data. We introduce a method of data synthesis with paraphrasing, a data augmentation technique with sample mixing, and curriculum learning with two new difficulty metrics based on specificity and abstractiveness. We conduct experiments to show that these three techniques can help improve abstractive summarization across two summarization models and two different small datasets. Furthermore, we show that these techniques can improve performance when applied in isolation and when combined.
CLSep 17, 2021
Exploring Multitask Learning for Low-Resource AbstractiveSummarizationAhmed Magooda, Mohamed Elaraby, Diane Litman
This paper explores the effect of using multitask learning for abstractive summarization in the context of small training corpora. In particular, we incorporate four different tasks (extractive summarization, language modeling, concept detection, and paraphrase detection) both individually and in combination, with the goal of enhancing the target task of abstractive summarization via multitask learning. We show that for many task combinations, a model trained in a multitask setting outperforms a model trained only for abstractive summarization, with no additional summarization data introduced. Additionally, we do a comprehensive search and find that certain tasks (e.g. paraphrase detection) consistently benefit abstractive summarization, not only when combined with other tasks but also when using different architectures and training corpora.
CLSep 4, 2021
A Neural Network-Based Linguistic Similarity Measure for Entrainment in ConversationsMingzhi Yu, Diane Litman, Shuang Ma et al.
Linguistic entrainment is a phenomenon where people tend to mimic each other in conversation. The core instrument to quantify entrainment is a linguistic similarity measure between conversational partners. Most of the current similarity measures are based on bag-of-words approaches that rely on linguistic markers, ignoring the overall language structure and dialogue context. To address this issue, we propose to use a neural network model to perform the similarity measure for entrainment. Our model is context-aware, and it further leverages a novel component to learn the shared high-level linguistic features across dialogues. We first investigate the effectiveness of our novel component. Then we use the model to perform similarity measure in a corpus-based entrainment analysis. We observe promising results for both evaluation tasks.
HCJul 14, 2021
Effective Interfaces for Student-Driven Revision Sessions for Argumentative WritingTazin Afrin, Omid Kashefi, Christopher Olshefski et al.
We present the design and evaluation of a web-based intelligent writing assistant that helps students recognize their revisions of argumentative essays. To understand how our revision assistant can best support students, we have implemented four versions of our system with differences in the unit span (sentence versus sub-sentence) of revision analysis and the level of feedback provided (none, binary, or detailed revision purpose categorization). We first discuss the design decisions behind relevant components of the system, then analyze the efficacy of the different versions through a Wizard of Oz study with university students. Our results show that while a simple interface with no revision feedback is easier to use, an interface that provides a detailed categorization of sentence-level revisions is the most helpful based on user survey data, as well as the most effective based on improvement in writing outcomes.
CLJul 14, 2021
Annotation and Classification of Evidence and Reasoning Revisions in Argumentative WritingTazin Afrin, Elaine Wang, Diane Litman et al.
Automated writing evaluation systems can improve students' writing insofar as students attend to the feedback provided and revise their essay drafts in ways aligned with such feedback. Existing research on revision of argumentative writing in such systems, however, has focused on the types of revisions students make (e.g., surface vs. content) rather than the extent to which revisions actually respond to the feedback provided and improve the essay. We introduce an annotation scheme to capture the nature of sentence-level revisions of evidence use and reasoning (the `RER' scheme) and apply it to 5th- and 6th-grade students' argumentative essays. We show that reliable manual annotation can be achieved and that revision annotations correlate with a holistic assessment of essay improvement in line with the feedback provided. Furthermore, we explore the feasibility of automatically classifying revisions according to our scheme.
CLMay 27, 2021
Leveraging Linguistic Coordination in Reranking N-Best Candidates For End-to-End Response Selection Using BERTMingzhi Yu, Diane Litman
Retrieval-based dialogue systems select the best response from many candidates. Although many state-of-the-art models have shown promising performance in dialogue response selection tasks, there is still quite a gap between R@1 and R@10 performance. To address this, we propose to leverage linguistic coordination (a phenomenon that individuals tend to develop similar linguistic behaviors in conversation) to rerank the N-best candidates produced by BERT, a state-of-the-art pre-trained language model. Our results show an improvement in R@1 compared to BERT baselines, demonstrating the utility of repairing machine-generated outputs by leveraging a linguistic theory.
CLFeb 20, 2021
Discussion Tracker: Supporting Teacher Learning about Students' Collaborative Argumentation in High School ClassroomsLuca Lugini, Christopher Olshefski, Ravneet Singh et al.
Teaching collaborative argumentation is an advanced skill that many K-12 teachers struggle to develop. To address this, we have developed Discussion Tracker, a classroom discussion analytics system based on novel algorithms for classifying argument moves, specificity, and collaboration. Results from a classroom deployment indicate that teachers found the analytics useful, and that the underlying classifiers perform with moderate to substantial agreement with humans.
CLFeb 20, 2021
Contextual Argument Component Classification for Class DiscussionsLuca Lugini, Diane Litman
Argument mining systems often consider contextual information, i.e. information outside of an argumentative discourse unit, when trained to accomplish tasks such as argument component identification, classification, and relation extraction. However, prior work has not carefully analyzed the utility of different contextual properties in context-aware models. In this work, we show how two different types of contextual information, local discourse context and speaker context, can be incorporated into a computational model for classifying argument components in multi-party classroom discussions. We find that both context types can improve performance, although the improvements are dependent on context size and position.
CLAug 4, 2020
Automated Topical Component Extraction Using Neural Network Attention Scores from Source-based Essay ScoringHaoran Zhang, Diane Litman
While automated essay scoring (AES) can reliably grade essays at scale, automated writing evaluation (AWE) additionally provides formative feedback to guide essay revision. However, a neural AES typically does not provide useful feature representations for supporting AWE. This paper presents a method for linking AWE and neural AES, by extracting Topical Components (TCs) representing evidence from a source text using the intermediate output of attention layers. We evaluate performance using a feature-based AES requiring TCs. Results show that performance is comparable whether using automatically or manually constructed TCs for 1) representing essays as rubric-based features, 2) grading essays.
CLMay 22, 2020
The Discussion Tracker Corpus of Collaborative ArgumentationChristopher Olshefski, Luca Lugini, Ravneet Singh et al.
Although Natural Language Processing (NLP) research on argument mining has advanced considerably in recent years, most studies draw on corpora of asynchronous and written texts, often produced by individuals. Few published corpora of synchronous, multi-party argumentation are available. The Discussion Tracker corpus, collected in American high school English classes, is an annotated dataset of transcripts of spoken, multi-party argumentation. The corpus consists of 29 multi-party discussions of English literature transcribed from 985 minutes of audio. The transcripts were annotated for three dimensions of collaborative argumentation: argument moves (claims, evidence, and explanations), specificity (low, medium, high) and collaboration (e.g., extensions of and disagreements about others' ideas). In addition to providing descriptive statistics on the corpus, we provide performance benchmarks and associated code for predicting each dimension separately, illustrate the use of the multiple annotations in the corpus to improve performance via multi-task learning, and finally discuss other ways the corpus might be used to further NLP research.
CLFeb 9, 2020
Abstractive Summarization for Low Resource Data using Domain Transfer and Data SynthesisAhmed Magooda, Diane Litman
Training abstractive summarization models typically requires large amounts of data, which can be a limitation for many domains. In this paper we explore using domain transfer and data synthesis to improve the performance of recent abstractive summarization methods when applied to small corpora of student reflections. First, we explored whether tuning state of the art model trained on newspaper data could boost performance on student reflection data. Evaluations demonstrated that summaries produced by the tuned model achieved higher ROUGE scores compared to model trained on just student reflection data or just newspaper data. The tuned model also achieved higher scores compared to extractive summarization baselines, and additionally was judged to produce more coherent and readable summaries in human evaluations. Second, we explored whether synthesizing summaries of student data could additionally boost performance. We proposed a template-based model to synthesize new data, which when incorporated into training further increased ROUGE scores. Finally, we showed that combining data synthesis with domain transfer achieved higher ROUGE scores compared to only using one of the two approaches.
CLSep 6, 2019
Annotating Student Talk in Text-based Classroom DiscussionsLuca Lugini, Diane Litman, Amanda Godley et al.
Classroom discussions in English Language Arts have a positive effect on students' reading, writing and reasoning skills. Although prior work has largely focused on teacher talk and student-teacher interactions, we focus on three theoretically-motivated aspects of high-quality student talk: argumentation, specificity, and knowledge domain. We introduce an annotation scheme, then show that the scheme can be used to produce reliable annotations and that the annotations are predictive of discussion quality. We also highlight opportunities provided by our scheme for education and natural language processing research.
CLSep 6, 2019
Argument Component Classification for Classroom DiscussionsLuca Lugini, Diane Litman
This paper focuses on argument component classification for transcribed spoken classroom discussions, with the goal of automatically classifying student utterances into claims, evidence, and warrants. We show that an existing method for argument component classification developed for another educationally-oriented domain performs poorly on our dataset. We then show that feature sets from prior work on argument mining for student essays and online dialogues can be used to improve performance considerably. We also provide a comparison between convolutional neural networks and recurrent neural networks when trained under different conditions to classify argument components in classroom discussions. While neural network models are not always able to outperform a logistic regression model, we were able to gain some useful insights: convolutional networks are more robust than recurrent networks both at the character and at the word level, and specificity information can help boost performance in multi-task training.
CLSep 3, 2019
Predicting Specificity in Classroom DiscussionLuca Lugini, Diane Litman
High quality classroom discussion is important to student development, enhancing abilities to express claims, reason about other students' claims, and retain information for longer periods of time. Previous small-scale studies have shown that one indicator of classroom discussion quality is specificity. In this paper we tackle the problem of predicting specificity for classroom discussions. We propose several methods and feature sets capable of outperforming the state of the art in specificity prediction. Additionally, we provide a set of meaningful, interpretable features that can be used to analyze classroom discussions at a pedagogical level.
CLSep 3, 2019
Annotation and Classification of Sentence-level Revision ImprovementTazin Afrin, Diane Litman
Studies of writing revisions rarely focus on revision quality. To address this issue, we introduce a corpus of between-draft revisions of student argumentative essays, annotated as to whether each revision improves essay quality. We demonstrate a potential usage of our annotations by developing a machine learning model to predict revision improvement. With the goal of expanding training data, we also extract revisions from a dataset edited by expert proofreaders. Our results indicate that blending expert and non-expert revisions increases model performance, with expert data particularly important for predicting low-quality revisions.
CLSep 3, 2019
Identifying Editor Roles in Argumentative Writing from Student Revision HistoriesTazin Afrin, Diane Litman
We present a method for identifying editor roles from students' revision behaviors during argumentative writing. We first develop a method for applying a topic modeling algorithm to identify a set of editor roles from a vocabulary capturing three aspects of student revision behaviors: operation, purpose, and position. We validate the identified roles by showing that modeling the editor roles that students take when revising a paper not only accounts for the variance in revision purposes in our data, but also relates to writing improvement.
CLSep 2, 2019
Identifying Personality Traits Using Overlap Dynamics in Multiparty DialogueMingzhi Yu, Emer Gilmartin, Diane Litman
Research on human spoken language has shown that speech plays an important role in identifying speaker personality traits. In this work, we propose an approach for identifying speaker personality traits using overlap dynamics in multiparty spoken dialogues. We first define a set of novel features representing the overlap dynamics of each speaker. We then investigate the impact of speaker personality traits on these features using ANOVA tests. We find that features of overlap dynamics significantly vary for speakers with different levels of both Extraversion and Conscientiousness. Finally, we find that classifiers using only overlap dynamics features outperform random guessing in identifying Extraversion and Agreeableness, and that the improvements are statistically significant.
CLSep 2, 2019
Investigating the Relationship between Multi-Party Linguistic Entrainment, Team Characteristics, and the Perception of Team Social OutcomesMingzhi Yu, Diane Litman, Susannah Paletz
Multi-party linguistic entrainment refers to the phenomenon that speakers tend to speak more similarly during conversation. We first developed new measures of multi-party entrainment on features describing linguistic style, and then examined the relationship between entrainment and team characteristics in terms of gender composition, team size, and diversity. Next, we predicted the perception of team social outcomes using multi-party linguistic entrainment and team characteristics with a hierarchical regression model. We found that teams with greater gender diversity had higher minimum convergence than teams with less gender diversity. Entrainment contributed significantly to predicting perceived team social outcomes both alone and controlling for team characteristics.
CLAug 6, 2019
Co-Attention Based Neural Network for Source-Dependent Essay ScoringHaoran Zhang, Diane Litman
This paper presents an investigation of using a co-attention based neural network for source-dependent essay scoring. We use a co-attention mechanism to help the model learn the importance of each part of the essay more accurately. Also, this paper shows that the co-attention based neural network model provides reliable score prediction of source-dependent responses. We evaluate our model on two source-dependent response corpora. Results show that our model outperforms the baseline on both corpora. We also show that the attention of the model is similar to the expert opinions with examples.
CLAug 6, 2019
eRevise: Using Natural Language Processing to Provide Formative Feedback on Text Evidence Usage in Student WritingHaoran Zhang, Ahmed Magooda, Diane Litman et al.
Writing a good essay typically involves students revising an initial paper draft after receiving feedback. We present eRevise, a web-based writing and revising environment that uses natural language processing features generated for rubric-based essay scoring to trigger formative feedback messages regarding students' use of evidence in response-to-text writing. By helping students understand the criteria for using text evidence during writing, eRevise empowers students to better revise their paper drafts. In a pilot deployment of eRevise in 7 classrooms spanning grades 5 and 6, the quality of text evidence usage in writing improved after students received formative feedback then engaged in paper revision.
CLAug 6, 2019
Word Embedding for Response-To-Text Assessment of EvidenceHaoran Zhang, Diane Litman
Manually grading the Response to Text Assessment (RTA) is labor intensive. Therefore, an automatic method is being developed for scoring analytical writing when the RTA is administered in large numbers of classrooms. Our long-term goal is to also use this scoring method to provide formative feedback to students and teachers about students' writing quality. As a first step towards this goal, interpretable features for automatically scoring the evidence rubric of the RTA have been developed. In this paper, we present a simple but promising method for improving evidence scoring by employing the word embedding model. We evaluate our method on corpora of responses written by upper elementary students.
CLJul 25, 2018
A Novel ILP Framework for Summarizing Content with High Lexical VarietyWencan Luo, Fei Liu, Zitao Liu et al.
Summarizing content contributed by individuals can be challenging, because people make different lexical choices even when describing the same events. However, there remains a significant need to summarize such content. Examples include the student responses to post-class reflective questions, product reviews, and news articles published by different news agencies related to the same events. High lexical diversity of these documents hinders the system's ability to effectively identify salient content and reduce summary redundancy. In this paper, we overcome this issue by introducing an integer linear programming-based summarization framework. It incorporates a low-rank approximation to the sentence-word co-occurrence matrix to intrinsically group semantically-similar lexical items. We conduct extensive experiments on datasets of student responses, product reviews, and news documents. Our approach compares favorably to a number of extractive baselines as well as a neural abstractive summarization system. The paper finally sheds light on when and why the proposed framework is effective at summarizing content with high lexical variety.
CLMay 25, 2018
An Improved Phrase-based Approach to Annotating and Summarizing Student Course ResponsesWencan Luo, Fei Liu, Diane Litman
Teaching large classes remains a great challenge, primarily because it is difficult to attend to all the student needs in a timely manner. Automatic text summarization systems can be leveraged to summarize the student feedback, submitted immediately after each lecture, but it is left to be discovered what makes a good summary for student responses. In this work we explore a new methodology that effectively extracts summary phrases from the student responses. Each phrase is tagged with the number of students who raise the issue. The phrases are evaluated along two dimensions: with respect to text content, they should be informative and well-formed, measured by the ROUGE metric; additionally, they shall attend to the most pressing student needs, measured by a newly proposed metric. This work is enabled by a phrase-based annotation and highlighting scheme, which is new to the summarization task. The phrase-based framework allows us to summarize the student responses into a set of bullet points and present to the instructor promptly.
CLMay 25, 2018
Automatic Summarization of Student Course FeedbackWencan Luo, Fei Liu, Zitao Liu et al.
Student course feedback is generated daily in both classrooms and online course discussion forums. Traditionally, instructors manually analyze these responses in a costly manner. In this work, we propose a new approach to summarizing student course feedback based on the integer linear programming (ILP) framework. Our approach allows different student responses to share co-occurrence statistics and alleviates sparsity issues. Experimental results on a student feedback corpus show that our approach outperforms a range of baselines in terms of both ROUGE scores and human evaluation.
CLFeb 28, 2017
A Joint Identification Approach for Argumentative Writing RevisionsFan Zhang, Diane Litman
Prior work on revision identification typically uses a pipeline method: revision extraction is first conducted to identify the locations of revisions and revision classification is then conducted on the identified revisions. Such a setting propagates the errors of the revision extraction step to the revision classification step. This paper proposes an approach that identifies the revision location and the revision type jointly to solve the issue of error propagation. It utilizes a sequence representation of revisions and conducts sequence labeling for revision identification. A mutation-based approach is utilized to update identification sequences. Results demonstrate that our proposed approach yields better performance on both revision location extraction and revision type classification compared to a pipeline baseline.
AIDec 3, 2016
Using Discourse Signals for Robust Instructor Intervention PredictionMuthu Kumar Chandrasekaran, Carrie Demmans Epp, Min-Yen Kan et al.
We tackle the prediction of instructor intervention in student posts from discussion forums in Massive Open Online Courses (MOOCs). Our key finding is that using automatically obtained discourse relations improves the prediction of when instructors intervene in student discussions, when compared with a state-of-the-art, feature-rich baseline. Our supervised classifier makes use of an automatic discourse parser which outputs Penn Discourse Treebank (PDTB) tags that represent in-post discourse features. We show PDTB relation-based features increase the robustness of the classifier and complement baseline features in recalling more diverse instructor intervention patterns. In comprehensive experiments over 14 MOOC offerings from several disciplines, the PDTB discourse features improve performance on average. The resultant models are less dependent on domain-specific vocabulary, allowing them to better generalize to new courses.