CLJul 28, 2023
Teach Me How to Improve My Argumentation Skills: A Survey on Feedback in ArgumentationCamélia Guerraoui, Paul Reisert, Naoya Inoue et al.
The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are useful for evaluating the quality of an argument, they oftentimes cannot explain why a particular argument is considered poor or not, which makes it difficult to provide constructive feedback to users to strengthen their critical thinking skills. In this survey, we aim to explore the different dimensions of feedback (Richness, Visualization, Interactivity, and Personalization) provided by the current computational models for argumentation, and the possibility of enhancing the power of explanations of such models, ultimately helping learners improve their critical thinking skills.
CYApr 11
Morally Programmed LLMs Reshape Human MoralityPengzhao Lyu, Yeun Joon Kim, Yingyue Luna Luan et al.
As large language models (LLMs) increasingly participate in high-stakes decision-making, a central societal debate has revolved around which moral frameworks-deontological or utilitarian-should guide machine behavior. However, a largely overlooked question is whether the moral principles that humans encode in LLMs could, through repeated interactions, reshape human moral inclinations. We developed two LLMs programmed with either deontological principles (D-LLM) or utilitarian principles (U-LLM) and conducted two pre-registered experiments involving extensive human-LLM interactions, comprising 15,985 total exchanges across the two experiments. Results show that interacting with these morally programmed LLMs systematically shifted human moral inclinations to align with the principles embedded in these systems. These effects remained strong two weeks after the interaction, with only slight decay, suggesting deep internalization rather than superficial agreement. Further, LLM-induced shifts in human moral inclinations translated into meaningful changes in socio-political policy evaluations, shaping how individuals approach contentious social issues. Overall, these results demonstrate that morally programmed LLMs can shape-not merely reflect-human morality, revealing a critical design paradox: embedding moral principles in LLMs not only restricts their behavior but also poses the risk of shaping human morality, raising important ethical and policy questions about who determines which principles intelligent machines should adhere to.
AIApr 26
Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing TaskJungmin Choi, Keisuke Sakaguchi, Hiroaki Yamada
Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs' generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.
LGFeb 13, 2025
Channel Dependence, Limited Lookback Windows, and the Simplicity of Datasets: How Biased is Time Series Forecasting?Ibram Abdelmalak, Kiran Madhusudhanan, Jungmin Choi et al.
Time-series forecasting research has converged to a small set of datasets and a standardized collection of evaluation scenarios. Such a standardization is to a specific extent needed for comparable research. However, the underlying assumption is, that the considered setting is a representative for the problem as a whole. In this paper, we challenge this assumption and show that the current scenario gives a strongly biased perspective on the state of time-series forecasting research. To be more detailed, we show that the current evaluation scenario is heavily biased by the simplicity of the current datasets. We furthermore emphasize, that when the lookback-window is properly tuned, current models usually do not need any information flow across channels. However, when using more complex benchmark data, the situation changes: Here, modeling channel-interactions in a sophisticated manner indeed enhances performances. Furthermore, in this complex evaluation scenario, Crossformer, a method regularly neglected as an important baseline, is the SOTA method for time series forecasting. Based on this, we present the Fast Channel-dependent Transformer (FaCT), a simplified version of Crossformer which closes the runtime gap between Crossformer and TimeMixer, leading to an efficient model for complex forecasting datasets.
CLJun 18, 2024
Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-fillingIrfan Robbani, Paul Reisert, Naoya Inoue et al.
Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy's implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf's alpha of 0.54) and reasonable coverage (0.83). Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available.