DL AI CLApr 4, 2024

Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning

Joan Giner-Miguelez, Abel Gómez, Jordi Cabot

arXiv:2404.15320v23.38 citationsh-index: 6Has Code

Originality Synthesis-oriented

AI Analysis

This helps data publishers and practitioners create machine-readable documentation to improve dataset discoverability and compliance with AI regulations, though it is incremental as it applies existing LLMs to a new task.

The paper tackles the problem of unstructured dataset documentation by using large language models (LLMs) to automatically extract key dimensions like provenance and social concerns, achieving accuracies of 81.21% with GPT3.5 and 69.13% with Flan-UL2.

Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them. In this paper, we evaluate the approach on 12 scientific dataset papers published in two scientific journals (Nature's Scientific Data and Elsevier's Data in Brief) using two different LLMs (GPT3.5 and Flan-UL2). Results show good accuracy with our prompt extraction strategies. Concrete results vary depending on the dimensions, but overall, GPT3.5 shows slightly better accuracy (81,21%) than FLAN-UL2 (69,13%) although it is more prone to hallucinations. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results, in an open-source repository.

View on arXiv PDF Code

Similar