Jaromir Savelka

CL
h-index32
34papers
2,194citations
Novelty34%
AI Score52

34 Papers

CYOct 1, 2023
The Robots are Here: Navigating the Generative AI Revolution in Computing Education

James Prather, Paul Denny, Juho Leinonen et al. · cmu

Recent advancements in artificial intelligence (AI) are fundamentally reshaping computing, with large language models (LLMs) now effectively being able to generate and interpret source code and natural language instructions. These emergent capabilities have sparked urgent questions in the computing education community around how educators should adapt their pedagogy to address the challenges and to leverage the opportunities presented by this new technology. In this working group report, we undertake a comprehensive exploration of LLMs in the context of computing education and make five significant contributions. First, we provide a detailed review of the literature on LLMs in computing education and synthesise findings from 71 primary articles. Second, we report the findings of a survey of computing students and instructors from across 20 countries, capturing prevailing attitudes towards LLMs and their use in computing education contexts. Third, to understand how pedagogy is already changing, we offer insights collected from in-depth interviews with 22 computing educators from five continents who have already adapted their curricula and assessments. Fourth, we use the ACM Code of Ethics to frame a discussion of ethical issues raised by the use of large language models in computing education, and we provide concrete advice for policy makers, educators, and students. Finally, we benchmark the performance of LLMs on various computing education datasets, and highlight the extent to which the capabilities of current models are rapidly improving. Our aim is that this report will serve as a focal point for both researchers and practitioners who are exploring, adapting, using, and evaluating LLMs and LLM-based tools in computing classrooms.

CLJun 15, 2023
Explaining Legal Concepts with Augmented Large Language Models (GPT-4)

Jaromir Savelka, Kevin D. Ashley, Morgan A. Gray et al. · cmu

Interpreting the meaning of legal open-textured terms is a key task of legal professionals. An important source for this interpretation is how the term was applied in previous court cases. In this paper, we evaluate the performance of GPT-4 in generating factually accurate, clear and relevant explanations of terms in legislation. We compare the performance of a baseline setup, where GPT-4 is directly asked to explain a legal term, to an augmented approach, where a legal information retrieval module is used to provide relevant context to the model, in the form of sentences from case law. We found that the direct application of GPT-4 yields explanations that appear to be of very high quality on their surface. However, detailed analysis uncovered limitations in terms of the factual accuracy of the explanations. Further, we found that the augmentation leads to improved quality, and appears to eliminate the issue of hallucination, where models invent incorrect statements. These findings open the door to the building of systems that can autonomously retrieve relevant sentences from case law and condense them into a useful explanation for legal scholars, educators or practicing lawyers alike.

CYJun 15, 2023
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses

Jaromir Savelka, Arav Agarwal, Marshall An et al. · cmu

This paper studies recent developments in large language models' (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class' assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4's handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.

CLJun 24, 2023
Can GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly Specialized Domain Expertise?

Jaromir Savelka, Kevin D. Ashley, Morgan A Gray et al. · cmu

We evaluated the capability of generative pre-trained transformers~(GPT-4) in analysis of textual data in tasks that require highly specialized domain expertise. Specifically, we focused on the task of analyzing court opinions to interpret legal concepts. We found that GPT-4, prompted with annotation guidelines, performs on par with well-trained law student annotators. We observed that, with a relatively minor decrease in performance, GPT-4 can perform batch predictions leading to significant cost reductions. However, employing chain-of-thought prompting did not lead to noticeably improved performance on this task. Further, we demonstrated how to analyze GPT-4's predictions to identify and mitigate deficiencies in annotation guidelines, and subsequently improve the performance of the model. Finally, we observed that the model is quite brittle, as small formatting related changes in the prompt had a high impact on the predictions. These findings can be leveraged by researchers and practitioners who engage in semantic/pragmatic annotations of texts in the context of the tasks requiring highly specialized domain expertise.

AIMar 16, 2023
Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?

Jaromir Savelka, Arav Agarwal, Christopher Bogart et al. · cmu

We evaluated the capability of generative pre-trained transformers (GPT), to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. Discussions of potential uses (e.g., exercise generation, code explanation) and misuses (e.g., cheating) of this emerging technology in programming education have intensified, but to date there has not been a rigorous analysis of the models' capabilities in the realistic context of a full-fledged programming course with diverse set of assessment instruments. We evaluated GPT on three Python courses that employ assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Further, we studied if and how successfully GPT models leverage feedback provided by an auto-grader. We found that the current models are not capable of passing the full spectrum of assessments typically involved in a Python programming course (<70% on even entry-level modules). Yet, it is clear that a straightforward application of these easily accessible models could enable a learner to obtain a non-trivial portion of the overall available score (>55%) in introductory and intermediate courses alike. While the models exhibit remarkable capabilities, including correcting solutions based on auto-grader's feedback, some limitations exist (e.g., poor handling of exercises requiring complex chains of reasoning steps). These findings can be leveraged by instructors wishing to adapt their assessments so that GPT becomes a valuable assistant for a learner as opposed to an end-to-end solution.

AIJun 30, 2023
Harnessing LLMs in Curricular Design: Using GPT-4 to Support Authoring of Learning Objectives

Pragnya Sridhar, Aidan Doyle, Arav Agarwal et al. · cmu

We evaluated the capability of a generative pre-trained transformer (GPT-4) to automatically generate high-quality learning objectives (LOs) in the context of a practically oriented university course on Artificial Intelligence. Discussions of opportunities (e.g., content generation, explanation) and risks (e.g., cheating) of this emerging technology in education have intensified, but to date there has not been a study of the models' capabilities in supporting the course design and authoring of LOs. LOs articulate the knowledge and skills learners are intended to acquire by engaging with a course. To be effective, LOs must focus on what students are intended to achieve, focus on specific cognitive processes, and be measurable. Thus, authoring high-quality LOs is a challenging and time consuming (i.e., expensive) effort. We evaluated 127 LOs that were automatically generated based on a carefully crafted prompt (detailed guidelines on high-quality LOs authoring) submitted to GPT-4 for conceptual modules and projects of an AI Practitioner course. We analyzed the generated LOs if they follow certain best practices such as beginning with action verbs from Bloom's taxonomy in regards to the level of sophistication intended. Our analysis showed that the generated LOs are sensible, properly expressed (e.g., starting with an action verb), and that they largely operate at the appropriate level of Bloom's taxonomy, respecting the different nature of the conceptual modules (lower levels) and projects (higher levels). Our results can be leveraged by instructors and curricular designers wishing to take advantage of the state-of-the-art generative models to support their curricular and course design efforts.

CLMar 9, 2023
Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code

Jaromir Savelka, Arav Agarwal, Christopher Bogart et al. · cmu

We analyzed effectiveness of three generative pre-trained transformer (GPT) models in answering multiple-choice question (MCQ) assessments, often involving short snippets of code, from introductory and intermediate programming courses at the postsecondary level. This emerging technology stirs countless discussions of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming education (e.g., cheating). However, the capabilities of GPT models and their limitations to reason about and/or analyze code in educational settings have been under-explored. We evaluated several OpenAI's GPT models on formative and summative MCQ assessments from three Python courses (530 questions). We found that MCQs containing code snippets are not answered as successfully as those that only contain natural language. While questions requiring to fill-in a blank in the code or completing a natural language statement about the snippet are handled rather successfully, MCQs that require analysis and/or reasoning about the code (e.g., what is true/false about the snippet, or what is its output) appear to be the most challenging. These findings can be leveraged by educators to adapt their instructional practices and assessments in programming courses, so that GPT becomes a valuable assistant for a learner as opposed to a source of confusion and/or potential hindrance in the learning process.

AIOct 28, 2023
Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies

Jakub Drápal, Hannes Westermann, Jaromir Savelka · cmu

Thematic analysis and other variants of inductive coding are widely used qualitative analytic methods within empirical legal studies (ELS). We propose a novel framework facilitating effective collaboration of a legal expert with a large language model (LLM) for generating initial codes (phase 2 of thematic analysis), searching for themes (phase 3), and classifying the data in terms of the themes (to kick-start phase 4). We employed the framework for an analysis of a dataset (n=785) of facts descriptions from criminal court opinions regarding thefts. The goal of the analysis was to discover classes of typical thefts. Our results show that the LLM, namely OpenAI's GPT-4, generated reasonable initial codes, and it was capable of improving the quality of the codes based on expert feedback. They also suggest that the model performed well in zero-shot classification of facts descriptions in terms of the themes. Finally, the themes autonomously discovered by the LLM appear to map fairly well to the themes arrived at by legal experts. These findings can be leveraged by legal researchers to guide their decisions in integrating LLMs into their thematic analyses, as well as other inductive coding projects.

CLNov 1, 2023
From Text to Structure: Using Large Language Models to Support the Development of Legal Expert Systems

Samyar Janatian, Hannes Westermann, Jinzhe Tan et al. · cmu

Encoding legislative text in a formal representation is an important prerequisite to different tasks in the field of AI & Law. For example, rule-based expert systems focused on legislation can support laypeople in understanding how legislation applies to them and provide them with helpful context and information. However, the process of analyzing legislation and other sources to encode it in the desired formal representation can be time-consuming and represents a bottleneck in the development of such systems. Here, we investigate to what degree large language models (LLMs), such as GPT-4, are able to automatically extract structured representations from legislation. We use LLMs to create pathways from legislation, according to the JusticeBot methodology for legal decision support systems, evaluate the pathways and compare them to manually created pathways. The results are promising, with 60% of generated pathways being rated as equivalent or better than manually created ones in a blind comparison. The approach suggests a promising path to leverage the capabilities of LLMs to ease the costly development of systems based on symbolic approaches that are transparent and explainable.

CLOct 24, 2022
Toward an Intelligent Tutoring System for Argument Mining in Legal Texts

Hannes Westermann, Jaromir Savelka, Vern R. Walker et al. · cmu

We propose an adaptive environment (CABINET) to support caselaw analysis (identifying key argument elements) based on a novel cognitive computing framework that carefully matches various machine learning (ML) capabilities to the proficiency of a user. CABINET supports law students in their learning as well as professionals in their work. The results of our experiments focused on the feasibility of the proposed framework are promising. We show that the system is capable of identifying a potential error in the analysis with very low false positives rate (2.0-3.5%), as well as of predicting the key argument element type (e.g., an issue or a holding) with a reasonably high F1-score (0.74).

HCJul 9, 2024
It Cannot Be Right If It Was Written by AI: On Lawyers' Preferences of Documents Perceived as Authored by an LLM vs a Human

Jakub Harasta, Tereza Novotná, Jaromir Savelka · cmu

Large Language Models (LLMs) enable a future in which certain types of legal documents may be generated automatically. This has a great potential to streamline legal processes, lower the cost of legal services, and dramatically increase access to justice. While many researchers focus on proposing and evaluating LLM-based applications supporting tasks in the legal domain, there is a notable lack of investigations into how legal professionals perceive content if they believe an LLM has generated it. Yet, this is a critical point as over-reliance or unfounded scepticism may influence whether such documents bring about appropriate legal consequences. This study is the necessary analysis of the ongoing transition towards mature generative AI systems. Specifically, we examined whether the perception of legal documents' by lawyers and law students (n=75) varies based on their assumed origin (human-crafted vs AI-generated). The participants evaluated the documents, focusing on their correctness and language quality. Our analysis revealed a clear preference for documents perceived as crafted by a human over those believed to be generated by AI. At the same time, most participants expect the future in which documents will be generated automatically. These findings could be leveraged by legal practitioners, policymakers, and legislators to implement and adopt legal document generation technology responsibly and to fuel the necessary discussions on how legal processes should be updated to reflect recent technological developments.

CLJul 27, 2023
LLMediator: GPT-4 Assisted Online Dispute Resolution

Hannes Westermann, Jaromir Savelka, Karim Benyekhlef · cmu

In this article, we introduce LLMediator, an experimental platform designed to enhance online dispute resolution (ODR) by utilizing capabilities of state-of-the-art large language models (LLMs) such as GPT-4. In the context of high-volume, low-intensity legal disputes, alternative dispute resolution methods such as negotiation and mediation offer accessible and cooperative solutions for laypeople. These approaches can be carried out online on ODR platforms. LLMediator aims to improve the efficacy of such processes by leveraging GPT-4 to reformulate user messages, draft mediator responses, and potentially autonomously engage in the discussions. We present and discuss several features of LLMediator and conduct initial qualitative evaluations, demonstrating the potential for LLMs to support ODR and facilitate amicable settlements. The initial proof of concept is promising and opens up avenues for further research in AI-assisted negotiation and mediation.

CYOct 31, 2023
Efficient Classification of Student Help Requests in Programming Courses Using Large Language Models

Jaromir Savelka, Paul Denny, Mark Liffiton et al. · cmu

The accurate classification of student help requests with respect to the type of help being sought can enable the tailoring of effective responses. Automatically classifying such requests is non-trivial, but large language models (LLMs) appear to offer an accessible, cost-effective solution. This study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying help requests from students in an introductory programming class. In zero-shot trials, GPT-3.5 and GPT-4 exhibited comparable performance on most categories, while GPT-4 outperformed GPT-3.5 in classifying sub-categories for requests related to debugging. Fine-tuning the GPT-3.5 model improved its performance to such an extent that it approximated the accuracy and consistency across categories observed between two human raters. Overall, this study demonstrates the feasibility of using LLMs to enhance educational systems through the automated classification of student needs.

CYNov 7, 2025
AI Literacy for Community Colleges: Instructors' Perspectives on Scenario-Based and Interactive Approaches to Teaching AI

Aparna Maya Warrier, Arav Agarwal, Jaromir Savelka et al.

This research category full paper investigates how community college instructors evaluate interactive, no-code AI literacy resources designed for non-STEM learners. As artificial intelligence becomes increasingly integrated into everyday technologies, AI literacy - the ability to evaluate AI systems, communicate with them, and understand their broader impacts - has emerged as a critical skill across disciplines. Yet effective, scalable approaches for teaching these concepts in higher education remain limited, particularly for students outside STEM fields. To address this gap, we developed AI User, an interactive online curriculum that introduces core AI concepts through scenario - based activities set in real - world contexts. This study presents findings from four focus groups with instructors who engaged with AI User materials and participated in structured feedback activities. Thematic analysis revealed that instructors valued exploratory tasks that simulated real - world AI use cases and fostered experimentation, while also identifying challenges related to scaffolding, accessibility, and multi-modal support. A ranking task for instructional support materials showed a strong preference for interactive demonstrations over traditional educational materials like conceptual guides or lecture slides. These findings offer insights into instructor perspectives on making AI concepts more accessible and relevant for broad learner audiences. They also inform the design of AI literacy tools that align with diverse teaching contexts and support critical engagement with AI in higher education.

MMNov 15, 2025
Can LLMs Create Legally Relevant Summaries and Analyses of Videos?

Lyra Hoeben-Kuil, Gijs van Dijck, Jaromir Savelka et al.

Understanding the legally relevant factual basis of an event and conveying it through text is a key skill of legal professionals. This skill is important for preparing forms (e.g., insurance claims) or other legal documents (e.g., court claims), but often presents a challenge for laypeople. Current AI approaches aim to bridge this gap, but mostly rely on the user to articulate what has happened in text, which may be challenging for many. Here, we investigate the capability of large language models (LLMs) to understand and summarize events occurring in videos. We ask an LLM to summarize and draft legal letters, based on 120 YouTube videos showing legal issues in various domains. Overall, 71.7\% of the summaries were rated as of high or medium quality, which is a promising result, opening the door to a number of applications in e.g. access to justice.

CYNov 7, 2025
AI Literacy Assessment Revisited: A Task-Oriented Approach Aligned with Real-world Occupations

Christopher Bogart, Aparna Warrier, Arav Agarwal et al.

As artificial intelligence (AI) systems become ubiquitous in professional contexts, there is an urgent need to equip workers, often with backgrounds outside of STEM, with the skills to use these tools effectively as well as responsibly, that is, to be AI literate. However, prevailing definitions and therefore assessments of AI literacy often emphasize foundational technical knowledge, such as programming, mathematics, and statistics, over practical knowledge such as interpreting model outputs, selecting tools, or identifying ethical concerns. This leaves a noticeable gap in assessing someone's AI literacy for real-world job use. We propose a work-task-oriented assessment model for AI literacy which is grounded in the competencies required for effective use of AI tools in professional settings. We describe the development of a novel AI literacy assessment instrument, and accompanying formative assessments, in the context of a US Navy robotics training program. The program included training in robotics and AI literacy, as well as a competition with practical tasks and a multiple choice scenario task meant to simulate use of AI in a job setting. We found that, as a measure of applied AI literacy, the competition's scenario task outperformed the tests we adopted from past research or developed ourselves. We argue that when training people for AI-related work, educators should consider evaluating them with instruments that emphasize highly contextualized practical skills rather than abstract technical knowledge, especially when preparing workers without technical backgrounds for AI-integrated roles.

CYNov 7, 2025
"I Like That You Have to Poke Around": Instructors on How Experiential Approaches to AI Literacy Spark Inquiry and Critical Thinking

Aparna Maya Warrier, Arav Agarwal, Jaromir Savelka et al.

As artificial intelligence (AI) increasingly shapes decision-making across domains, there is a growing need to support AI literacy among learners beyond computer science. However, many current approaches rely on programming-heavy tools or abstract lecture-based content, limiting accessibility for non-STEM audiences. This paper presents findings from a study of AI User, a modular, web-based curriculum that teaches core AI concepts through interactive, no-code projects grounded in real-world scenarios. The curriculum includes eight projects; this study focuses on instructor feedback on Projects 5-8, which address applied topics such as natural language processing, computer vision, decision support, and responsible AI. Fifteen community college instructors participated in structured focus groups, completing the projects as learners and providing feedback through individual reflection and group discussion. Using thematic analysis, we examined how instructors evaluated the design, instructional value, and classroom applicability of these experiential activities. Findings highlight instructors' appreciation for exploratory tasks, role-based simulations, and real-world relevance, while also surfacing design trade-offs around cognitive load, guidance, and adaptability for diverse learners. This work extends prior research on AI literacy by centering instructor perspectives on teaching complex AI topics without code. It offers actionable insights for designing inclusive, experiential AI learning resources that scale across disciplines and learner backgrounds.

CLMay 16
Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

Li Zhang, Jaromir Savelka, Kevin Ashley

Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.

CYDec 5, 2023
A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

Jacob Doughty, Zipiao Wan, Anishka Bompelli et al. · cmu

There is a constant need for educators to develop and maintain effective up-to-date assessments. While there is a growing body of research in computing education on utilizing large language models (LLMs) in generation and engagement with coding exercises, the use of LLMs for generating programming MCQs has not been extensively explored. We analyzed the capability of GPT-4 to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education. Specifically, we developed an LLM-powered (GPT-4) system for generation of MCQs from high-level course context and module-level LOs. We evaluated 651 LLM-generated and 449 human-crafted MCQs aligned to 246 LOs from 6 Python courses. We found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. We also observed that the generated MCQs appeared to be well-aligned with the LOs. Our findings can be leveraged by educators wishing to take advantage of the state-of-the-art generative models to support MCQ authoring efforts.

CYDec 19, 2024
Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Tools

James Prather, Juho Leinonen, Natalie Kiesler et al. · cmu

Generative AI (GenAI) is advancing rapidly, and the literature in computing education is expanding almost as quickly. Initial responses to GenAI tools were mixed between panic and utopian optimism. Many were fast to point out the opportunities and challenges of GenAI. Researchers reported that these new tools are capable of solving most introductory programming tasks and are causing disruptions throughout the curriculum. These tools can write and explain code, enhance error messages, create resources for instructors, and even provide feedback and help for students like a traditional teaching assistant. In 2024, new research started to emerge on the effects of GenAI usage in the computing classroom. These new data involve the use of GenAI to support classroom instruction at scale and to teach students how to code with GenAI. In support of the former, a new class of tools is emerging that can provide personalized feedback to students on their programming assignments or teach both programming and prompting skills at the same time. With the literature expanding so rapidly, this report aims to summarize and explain what is happening on the ground in computing classrooms. We provide a systematic literature review; a survey of educators and industry professionals; and interviews with educators using GenAI in their courses, educators studying GenAI, and researchers who create GenAI tools to support computing education. The triangulation of these methods and data sources expands the understanding of GenAI usage and perceptions at this critical moment for our community.

CLApr 14, 2024
Understanding the Role of Temperature in Diverse Question Generation by GPT-4

Arav Agarwal, Karthik Mittal, Aidan Doyle et al. · cmu

We conduct a preliminary study of the effect of GPT's temperature parameter on the diversity of GPT4-generated questions. We find that using higher temperature values leads to significantly higher diversity, with different temperatures exposing different types of similarity between generated sets of questions. We also demonstrate that diverse question generation is especially difficult for questions targeting lower levels of Bloom's Taxonomy.

CLNov 7, 2025
What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Klára Bendová, Tomáš Knap, Jan Černý et al.

Criminal justice administrative data contain only a limited amount of information about the committed offense. However, there is an unused source of extensive information in continental European courts' decisions: descriptions of criminal behaviors in verdicts by which offenders are found guilty. In this paper, we study the feasibility of extracting these descriptions from publicly available court decisions from Slovakia. We use two different approaches for retrieval: regular expressions and large language models (LLMs). Our baseline was a simple method employing regular expressions to identify typical words occurring before and after the description. The advanced regular expression approach further focused on "sparing" and its normalization (insertion of spaces between individual letters), typical for delineating the description. The LLM approach involved prompting the Gemini Flash 2.0 model to extract the descriptions using predefined instructions. Although the baseline identified descriptions in only 40.5% of verdicts, both methods significantly outperformed it, achieving 97% with advanced regular expressions and 98.75% with LLMs, and 99.5% when combined. Evaluation by law students showed that both advanced methods matched human annotations in about 90% of cases, compared to just 34.5% for the baseline. LLMs fully matched human-labeled descriptions in 91.75% of instances, and a combination of advanced regular expressions with LLMs reached 92%.

CYJan 17, 2025
AI Technicians: Developing Rapid Occupational Training Methods for a Competitive AI Workforce

Jaromir Savelka, Can Kultur, Arav Agarwal et al. · cmu

The accelerating pace of developments in Artificial Intelligence~(AI) and the increasing role that technology plays in society necessitates substantial changes in the structure of the workforce. Besides scientists and engineers, there is a need for a very large workforce of competent AI technicians (i.e., maintainers, integrators) and users~(i.e., operators). As traditional 4-year and 2-year degree-based education cannot fill this quickly opening gap, alternative training methods have to be developed. We present the results of the first four years of the AI Technicians program which is a unique collaboration between the U.S. Army's Artificial Intelligence Integration Center (AI2C) and Carnegie Mellon University to design, implement and evaluate novel rapid occupational training methods to create a competitive AI workforce at the technicians level. Through this multi-year effort we have already trained 59 AI Technicians. A key observation is that ongoing frequent updates to the training are necessary as the adoption of AI in the U.S. Army and within the society at large is evolving rapidly. A tight collaboration among the stakeholders from the army and the university is essential for successful development and maintenance of the training for the evolving role. Our findings can be leveraged by large organizations that face the challenge of developing a competent AI workforce as well as educators and researchers engaged in solving the challenge.

CYApr 13
MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling

Dhriti Krishnan, Jaromir Savelka

Predicting the difficulty of multiple-choice questions (MCQs) is important for effective assessment, yet current methods typically assume a unimodal student ability distribution, overlooking the heterogeneous nature of student misconceptions. We propose a persona-driven framework that replaces theoretical ability sampling with data-driven cognitive profiling. Using student interactions from the EEDI dataset, we identify behavioral personas via latent class analysis (LCA), then condition a large language model (LLM) to simulate response distributions for each persona. These signals are aggregated with topic context and fed into a Ridge Regression model to predict the item response theory (IRT) difficulty parameter. With five-fold cross-validation, our method improves over a recent baseline (MSE: 0.367 to 0.274; R2: 0.525 to 0.686). The discovered personas are interpretable and offer insights into why items are difficult, with potential applications to diagnostic assessment design.

CLMay 31, 2025
Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments

Li Zhang, Morgan Gray, Jaromir Savelka et al. · cmu

Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 & 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Link: https://lizhang-aiandlaw.github.io/An-Automated-Pipeline-for-Evaluating-LLM-Generated-3-ply-Case-Based-Legal-Arguments/

CLDec 16, 2024
Analyzing Images of Legal Documents: Toward Multi-Modal LLMs for Access to Justice

Hannes Westermann, Jaromir Savelka · cmu

Interacting with the legal system and the government requires the assembly and analysis of various pieces of information that can be spread across different (paper) documents, such as forms, certificates and contracts (e.g. leases). This information is required in order to understand one's legal rights, as well as to fill out forms to file claims in court or obtain government benefits. However, finding the right information, locating the correct forms and filling them out can be challenging for laypeople. Large language models (LLMs) have emerged as a powerful technology that has the potential to address this gap, but still rely on the user to provide the correct information, which may be challenging and error-prone if the information is only available in complex paper documents. We present an investigation into utilizing multi-modal LLMs to analyze images of handwritten paper forms, in order to automatically extract relevant information in a structured format. Our initial results are promising, but reveal some limitations (e.g., when the image quality is low). Our work demonstrates the potential of integrating multi-modal LLMs to support laypeople and self-represented litigants in finding and assembling relevant information.

CLOct 23, 2025
Do LLMs Truly Understand When a Precedent Is Overruled?

Li Zhang, Jaromir Savelka, Kevin Ashley · cmu

Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity -- the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning -- models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures -- models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.

CLMay 8, 2023
Unlocking Practical Applications in Legal Domain: Evaluation of GPT for Zero-Shot Semantic Annotation of Legal Texts

Jaromir Savelka

We evaluated the capability of a state-of-the-art generative pre-trained transformer (GPT) model to perform semantic annotation of short text snippets (one to few sentences) coming from legal documents of various types. Discussions of potential uses (e.g., document drafting, summarization) of this emerging technology in legal domain have intensified, but to date there has not been a rigorous analysis of these large language models' (LLM) capacity in sentence-level semantic annotation of legal texts in zero-shot learning settings. Yet, this particular type of use could unlock many practical applications (e.g., in contract review) and research opportunities (e.g., in empirical legal studies). We fill the gap with this study. We examined if and how successfully the model can semantically annotate small batches of short text snippets (10-50) based exclusively on concise definitions of the semantic types. We found that the GPT model performs surprisingly well in zero-shot settings on diverse types of documents (F1=.73 on a task involving court opinions, .86 for contracts, and .54 for statutes and regulations). These findings can be leveraged by legal scholars and practicing lawyers alike to guide their decisions in integrating LLMs in wide range of workflows involving semantic annotation of legal texts.

LGJan 17, 2022
Data-Centric Machine Learning in the Legal Domain

Hannes Westermann, Jaromir Savelka, Vern R. Walker et al.

Machine learning research typically starts with a fixed data set created early in the process. The focus of the experiments is finding a model and training procedure that result in the best possible performance in terms of some selected evaluation metric. This paper explores how changes in a data set influence the measured performance of a model. Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance of a trained deep learning classifier. We assess the overall performance (weighted average) as well as the per-class performance. The observed effects are surprisingly pronounced, especially when the per-class performance is considered. We investigate how "semantic homogeneity" of a class, i.e., the proximity of sentences in a semantic embedding space, influences the difficulty of its classification. The presented results have far reaching implications for efforts related to data collection and curation in the field of AI & Law. The results also indicate that enhancements to a data set could be considered, alongside the advancement of the ML models, as an additional path for increasing classification performance on various tasks in AI & Law. Finally, we discuss the need for an established methodology to assess the potential effects of data set properties.

CLDec 21, 2021
Sentence Embeddings and High-speed Similarity Search for Fast Computer Assisted Annotation of Legal Documents

Hannes Westermann, Jaromir Savelka, Vern R. Walker et al.

Human-performed annotation of sentences in legal documents is an important prerequisite to many machine learning based systems supporting legal tasks. Typically, the annotation is done sequentially, sentence by sentence, which is often time consuming and, hence, expensive. In this paper, we introduce a proof-of-concept system for annotating sentences "laterally." The approach is based on the observation that sentences that are similar in meaning often have the same label in terms of a particular type system. We use this observation in allowing annotators to quickly view and annotate sentences that are semantically similar to a given sentence, across an entire corpus of documents. Here, we present the interface of the system and empirically evaluate the approach. The experiments show that lateral annotation has the potential to make the annotation process quicker and more consistent.

CLDec 15, 2021
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

Jaromir Savelka, Hannes Westermann, Karim Benyekhlef et al.

In this paper, we examine the use of multi-lingual sentence embeddings to transfer predictive models for functional segmentation of adjudicatory decisions across jurisdictions, legal systems (common and civil law), languages, and domains (i.e. contexts). Mechanisms for utilizing linguistic resources outside of their original context have significant potential benefits in AI & Law because differences between legal systems, languages, or traditions often block wider adoption of research outcomes. We analyze the use of Language-Agnostic Sentence Representations in sequence labeling models using Gated Recurrent Units (GRUs) that are transferable across languages. To investigate transfer between different contexts we developed an annotation scheme for functional segmentation of adjudicatory decisions. We found that models generalize beyond the contexts on which they were trained (e.g., a model trained on administrative decisions from the US can be applied to criminal law decisions from Italy). Further, we found that training the models on multiple contexts increases robustness and improves overall performance when evaluating on previously unseen contexts. Finally, we found that pooling the training data from all the contexts enhances the models' in-context performance.

CLDec 15, 2021
Cross-Domain Generalization and Knowledge Transfer in Transformers Trained on Legal Data

Jaromir Savelka, Hannes Westermann, Karim Benyekhlef

We analyze the ability of pre-trained language models to transfer knowledge among datasets annotated with different type systems and to generalize beyond the domain and dataset they were trained on. We create a meta task, over multiple datasets focused on the prediction of rhetorical roles. Prediction of the rhetorical role a sentence plays in a case decision is an important and often studied task in AI & Law. Typically, it requires the annotation of a large number of sentences to train a model, which can be time-consuming and expensive. Further, the application of the models is restrained to the same dataset it was trained on. We fine-tune language models and evaluate their performance across datasets, to investigate the models' ability to generalize across domains. Our results suggest that the approach could be helpful in overcoming the cold-start problem in active or interactvie learning, and shows the ability of the models to generalize across datasets and domains.

CLDec 14, 2021
Discovering Explanatory Sentences in Legal Case Decisions Using Pre-trained Language Models

Jaromir Savelka, Kevin D. Ashley

Legal texts routinely use concepts that are difficult to understand. Lawyers elaborate on the meaning of such concepts by, among other things, carefully investigating how have they been used in past. Finding text snippets that mention a particular concept in a useful way is tedious, time-consuming, and, hence, expensive. We assembled a data set of 26,959 sentences, coming from legal case decisions, and labeled them in terms of their usefulness for explaining selected legal concepts. Using the dataset we study the effectiveness of transformer-based models pre-trained on large language corpora to detect which of the sentences are useful. In light of models' predictions, we analyze various linguistic properties of the explanatory sentences as well as their relationship to the legal concept that needs to be explained. We show that the transformer-based models are capable of learning surprisingly sophisticated features and outperform the prior approaches to the task.

LGDec 10, 2021
Computer-Assisted Creation of Boolean Search Rules for Text Classification in the Legal Domain

Hannes Westermann, Jaromir Savelka, Vern R. Walker et al.

In this paper, we present a method of building strong, explainable classifiers in the form of Boolean search rules. We developed an interactive environment called CASE (Computer Assisted Semantic Exploration) which exploits word co-occurrence to guide human annotators in selection of relevant search terms. The system seamlessly facilitates iterative evaluation and improvement of the classification rules. The process enables the human annotators to leverage the benefits of statistical information while incorporating their expert intuition into the creation of such rules. We evaluate classifiers created with our CASE system on 4 datasets, and compare the results to machine learning methods, including SKOPE rules, Random forest, Support Vector Machine, and fastText classifiers. The results drive the discussion on trade-offs between superior compactness, simplicity, and intuitiveness of the Boolean search rules versus the better performance of state-of-the-art machine learning models for text classification.