CLMar 11, 2024Code
On the Consideration of AI Openness: Can Good Intent Be Abused?Yeeun Kim, Hyunseo Shin, Eunkyung Choi et al.
Open source is a driving force behind scientific advancement.However, this openness is also a double-edged sword, with the inherent risk that innovative technologies can be misused for purposes harmful to society. What is the likelihood that an open source AI model or dataset will be used to commit a real-world crime, and if a criminal does exploit it, will the people behind the technology be able to escape legal liability? To address these questions, we explore a legal domain where individual choices can have a significant impact on society. Specifically, we build the EVE-V1 dataset that comprises 200 question-answer pairs related to criminal offenses based on 200 Korean precedents first to explore the possibility of malicious models emerging. We further developed EVE-V2 using 600 fraud-related precedents to confirm the existence of malicious models that can provide harmful advice on a wide range of criminal topics to test the domain generalization ability. Remarkably, widely used open-source large-scale language models (LLMs) provide unethical and detailed information about criminal activities when fine-tuned with EVE. We also take an in-depth look at the legal issues that malicious language models and their builders could realistically face. Our findings highlight the paradoxical dilemma that open source accelerates scientific progress, but requires great care to minimize the potential for misuse. Warning: This paper contains content that some may find unethical.
CLApr 2, 2025Code
LRAGE: Legal Retrieval Augmented Generation Evaluation ToolMinhu Park, Hongseok Oh, Eunkyung Choi et al.
Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.
CLMar 5, 2025Code
Taxation Perspectives from Large Language Models: A Case Study on Additional Tax PenaltiesEunkyung Choi, Young Jin Suh, Hun Park et al.
How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain in general, research dedicated to taxation remain scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or unavailable as open source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT is constructed to evaluate LLMs' understanding of tax law, particularly in cases where resolving the issue requires more than just applying related statutes. Our experiments with six LLMs reveal that their baseline capabilities are limited, especially when dealing with conflicting issues that demand a comprehensive understanding. However, we found that enabling retrieval, self-reasoning, and discussion among multiple agents with specific role assignments, this limitation can be mitigated.
CLOct 11, 2024
Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language ModelsYeeun Kim, Young Rok Choi, Eunkyung Choi et al.
Large language models (LLMs) have demonstrated remarkable performance in the legal domain, with GPT-4 even passing the Uniform Bar Exam in the U.S. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application. Here, we introduce KBL, a benchmark for assessing the Korean legal language understanding of LLMs, consisting of (1) 7 legal knowledge tasks (510 examples), (2) 4 legal reasoning tasks (288 examples), and (3) the Korean bar exam (4 domains, 53 tasks, 2,510 examples). First two datasets were developed in close collaboration with lawyers to evaluate LLMs in practical scenarios in a certified manner. Furthermore, considering legal practitioners' frequent use of extensive legal documents for research, we assess LLMs in both a closed book setting, where they rely solely on internal knowledge, and a retrieval-augmented generation (RAG) setting, using a corpus of Korean statutes and precedents. The results indicate substantial room and opportunities for improvement.